(Python) 将输出文本文件分解为标记

(Python) Breaking an output text file into tokens

短篇小说:我有一个来自系统的输出文件,分为由“| |;”分隔的标记,我需要在其中获取管道“|”之间的内容并将它们写入另一个文件。

输出文件如下所示:

|Operation_ID|,|Operation_Name|,|business_group_name|,|business_unit_name|,|Program_ID|,|Program_Name|,|Project_ID|,|Project_Name|,|Program_Type_Name|,|Program_Cost_Type_Name|,|Start_date|,|Estimated_End_Date|,|End_Date|,|SQA_Name|,|CMA_Name|,|SSE_Name|,|PMs|,|TLs|,|PortfolioManager|,|Finished|,|Research|,|SQA_ID|,|CMA_ID|,|SSE_ID|
|20|,|XXX|,|YYY|,|ZZZ|,|1|,|WWW|,|2163|,|QQQ|,||,||,|15/12/2008|,||,|22/01/2009|,||,||,||,|EEE EEE |,||,||,|True|,||,||,||,||
|22|,|XXX|,|YYY|,|ZZZ|,|3|,|WWW|,|2165|,|QQQ|,||,||,|01/01/2009|,||,|09/04/2010|,||,||,||,|EEE EEE EEE|,||,||,|True|,|False|,||,||,||
|20|,|XXX|,|YYY|,|ZZZ|,|10|,|WWW|,|2164|,|QQQ|,|Development|,|Direct|,|15/12/2008|,||,|26/02/2010|,||,||,||,|EEE |,|EEE EEE ; EEE EEE ; EEE EEE |,||,|True|,|False|,||,||,||
|22|,|XXX|,|YYY|,|ZZZ|,|3|,|WWW|,|2166|,|QQQ|,||,||,|15/12/2008|,||,|31/05/2010|,||,||,||,||,||,||,|True|,|False|,||,||,||
|20|,|XXX|,|YYY|,|ZZZ|,|10|,|WWW|,|2168|,|QQQ|,|Development|,|Direct|,|05/01/2009|,||,|20/05/2009|,||,||,||,|EEE EEE EEE|,|EEE EEE |,||,|True|,||,||,||,||
|20|,|XXX|,|YYY|,|ZZZ|,|1|,|WWW|,|2169|,|QQQ|,||,||,|13/01/2009|,||,|22/05/2009|,||,||,||,|EEE EEE EEE|,|EEE EEE EEE EEE|,||,|True|,||,||,||,||
|21|,|XXX|,|YYY|,|ZZZ|,|2|,|WWW|,|2174|,|QQQ|,||,||,|08/01/2009|,||,|20/04/2009|,||,||,||,|EEE EEE |,|EEE EEE|,||,|True|,||,||,||,||
|23|,|XXX|,|YYY|,|ZZZ|,|47|,|WWW|,|2176|,|QQQ|,|Internal|,|Indirect|,|21/01/2009|,||,|17/12/2010|,||,||,||,|EEE EEE; EEE EEE|,||,||,|True|,|True|,||,||,||
|20|,|XXX|,|YYY|,|ZZZ|,|1|,|WWW|,|2142|,|QQQ|,||,||,|21/10/2008|,||,|13/05/2009|,||,||,||,|EEE EEE |,||,||,|True|,||,||,||,||
|20|,|XXX|,|YYY|,|ZZZ|,|1|,|WWW|,|2147|,|QQQ|,||,||,|07/11/2008|,||,|26/11/2008|,||,||,||,|EEE EEE EEE EEE |,|EEE EEE |,||,|True|,||,||,||,||
|20|,|XXX|,|YYY|,|ZZZ|,|1|,|WWW|,|2148|,|QQQ|,||,||,|07/11/2008|,||,|09/04/2009|,||,||,||,||,||,||,|True|,||,||,||,||
|22|,|XXX|,|YYY|,|ZZZ|,|3|,|WWW|,|2149|,|QQQ|,||,||,|01/11/2008|,|31/01/2011|,|01/12/2010|,||,||,||,|EEE EEE ; EEE EEE|,|EEE EEE; EEE EEE|,||,|True|,|False|,||,||,||
|22|,|XXX|,|YYY|,|ZZZ|,|20|,|WWW|,|2150|,|QQQ|,|Development|,||,|31/10/2008|,|31/10/2010|,|29/10/2010|,||,||,||,|EEE EEE |,|EEE EEE |,||,|True|,|False|,||,||,||
|20|,|XXX|,|YYY|,|ZZZ|,|1|,|WWW|,|2152|,|QQQ|,||,||,|26/11/2008|,||,|03/07/2009|,||,||,||,|EEE EEE EEE ; EEE EEE EEE EEE |,|EEE EEE |,||,|True|,||,||,||,||
|22|,|XXX|,|YYY|,|ZZZ|,|3|,|WWW|,|2151|,|QQQ|,||,||,|01/11/2008|,||,|29/01/2009|,||,||,||,||,||,||,|True|,||,||,||,||
|23|,|XXX|,|YYY|,|ZZZ|,|47|,|WWW|,|2187|,|QQQ|,|Internal|,|Indirect|,|21/01/2009|,||,|03/12/2009|,||,||,||,|EEE EEE|,|EEE EEE EEE|,||,|True|,|True|,||,||,||
|23|,|XXX|,|YYY|,|ZZZ|,|47|,|WWW|,|2192|,|QQQ|,|Internal|,|Indirect|,|21/01/2009|,||,|11/01/2011|,||,||,||,|EEE EEE EEE; EEE EEE|,||,||,|True|,|True|,||,||,||
|20|,|XXX|,|YYY|,|ZZZ|,|1|,|WWW|,|2196|,|QQQ|,||,||,|23/01/2009|,||,|24/03/2010|,||,||,||,|EEE EEE |,||,||,|True|,|False|,||,||,||
|21|,|XXX|,|YYY|,|ZZZ|,|41|,|WWW|,|2231|,|QQQ|,|Research|,||,|21/05/2009|,||,|01/12/2009|,||,||,||,||,||,||,|True|,|False|,||,||,||
|21|,|XXX|,|YYY|,|ZZZ|,|41|,|WWW|,|2230|,|QQQ|,|Research|,||,|21/05/2009|,||,|30/11/2009|,||,||,||,||,||,||,|True|,|False|,||,||,||
|21|,|XXX|,|YYY|,|ZZZ|,|41|,|WWW|,|2232|,|QQQ|,|Research|,||,|21/05/2009|,||,|09/07/2010|,||,||,||,||,|EEE EEE EEE|,||,|True|,|True|,||,||,||
|24|,|XXX|,|YYY|,|ZZZ|,|44|,|WWW|,|2237|,|QQQ|,|Research|,|Indirect|,|21/05/2009|,||,|22/01/2010|,||,||,||,||,||,||,|True|,|False|,||,||,||
|21|,|XXX|,|YYY|,|ZZZ|,|41|,|WWW|,|2238|,|QQQ|,|Research|,||,|21/05/2009|,||,|25/02/2010|,||,||,||,||,||,||,|True|,|False|,||,||,||
|21|,|XXX|,|YYY|,|ZZZ|,|41|,|WWW|,|2239|,|QQQ|,|Research|,||,|21/05/2009|,||,|04/01/2011|,||,||,||,||,||,||,|True|,|True|,||,||,||
|21|,|XXX|,|YYY|,|ZZZ|,|41|,|WWW|,|2240|,|QQQ|,|Research|,||,|21/05/2009|,||,|05/01/2011|,||,||,||,||,||,||,|True|,|True|,||,||,||
|26|,|XXX|,|YYY|,|ZZZ|,|50|,|WWW|,|2242|,|QQQ|,|Internal|,|Indirect|,|21/05/2009|,||,|14/10/2010|,||,||,||,||,||,||,|True|,|True|,||,||,||
|22|,|XXX|,|YYY|,|ZZZ|,|3|,|WWW|,|2273|,|QQQ|,||,||,|25/05/2009|,||,|29/01/2010|,||,||,||,||,|EEE EEE|,||,|True|,|False|,||,||,||

总的来说,我是 python/programming 的新手,所以我尝试编写以下算法:

# => Reads the file test.txt;
# => Scans character by character for '|' character;
# => If character '|' is found, skips to next character and add subsequent
# characters to a 'token' array, until next character is '|' again;
# => When next character is '|', add 'token' array to 'array_of_tokens';
# => Once END OF FILE arrives, writes 'array_of_tokens' to 'test_output.txt'
# file;


test_file = 'test.txt'
test_output = 'test_output.txt'
token = []
array_of_tokens = []
index = 0

# => Reads the file test.txt;
with open(test_file) as file:
    while True:
        # => Scans character by character for '|' character;
        character = file.read(1)
        # => If character '|' is found,
        if character == "|"
            # skips to next character
            character = next(character),
            # until next character is '|' again;
            while not character == '|'
                # add subsequent characters to a 'token' array
                token(index) = character
                index ++
                character = next(character)
            # => When next character is '|', add 'token' array to 'array_of_tokens';
            if next(character) == '|'
                array_of_tokens = token

        else if not character:
            break
        print "Read a character: ", character

# => Once END OF FILE arrives, writes 'array_of_tokens' to 'test_output.txt'
# file;
test_output.write(str(array_of_tokens))

而且它显然不起作用。问题是,我不完全确定我现在应该做什么,我知道我需要的结果(写在评论中),但我不确定如何让我的代码工作。有人可以帮忙吗?另外,如果有关于在哪里寻找的任何提示 advice/resources 我可以考虑成为一名更好的程序员,一个真正的程序员,我将非常感激!

提前致谢!

只需使用 str.translate 删除 |,拆分 ,filter 空字符串:

In [9]: s="|22|,|XXX|,|YYY|,|ZZZ|,|3|,|WWW|,|2273|,|QQQ|,||,||,|25/05/2009|,||,|29/01/2010|,||,||,||,||,|EEE EEE|,||,|True|,|False|,||,||,||"



In [10]: print(filter(None,s.translate(None,"|").split(",")))
['22', 'XXX', 'YYY', 'ZZZ', '3', 'WWW', '2273', 'QQQ', '25/05/2009', '29/01/2010', 'EEE EEE', 'True', 'False']

如果您需要数据与列对齐,请不要过滤。

因此,根据您希望将数据写入输出文件的方式,您只需要使用以下内容即可:

with open("test.txt") as f, open('test_output.txt',"w") as out:
    wr = csv.writer(out, delimiter=",")
    for line in f:
        wr.writerow(filter(None, line.rstrip().translate(None, "|").split(",")))

您的输出将是:

Operation_ID,Operation_Name,business_group_name,business_unit_name,Program_ID,Program_Name,Project_ID,Project_Name,Program_Type_Name,Program_Cost_Type_Name,Start_date,Estimated_End_Date,End_Date,SQA_Name,CMA_Name,SSE_Name,PMs,TLs,PortfolioManager,Finished,Research,SQA_ID,CMA_ID,SSE_ID
20,XXX,YYY,ZZZ,1,WWW,2163,QQQ,15/12/2008,22/01/2009,EEE EEE ,True
22,XXX,YYY,ZZZ,3,WWW,2165,QQQ,01/01/2009,09/04/2010,EEE EEE EEE,True,False
20,XXX,YYY,ZZZ,10,WWW,2164,QQQ,Development,Direct,15/12/2008,26/02/2010,EEE ,EEE EEE ; EEE EEE ; EEE EEE ,True,False
22,XXX,YYY,ZZZ,3,WWW,2166,QQQ,15/12/2008,31/05/2010,True,False
20,XXX,YYY,ZZZ,10,WWW,2168,QQQ,Development,Direct,05/01/2009,20/05/2009,EEE EEE EEE,EEE EEE ,True
20,XXX,YYY,ZZZ,1,WWW,2169,QQQ,13/01/2009,22/05/2009,EEE EEE EEE,EEE EEE EEE EEE,True
 etc.................

正如 tdelaney 在评论中提到的那样,这确实假定您的管道内没有任何管道。

对于 python3 我们需要做更多的工作,因为 str.translate 略有不同。我们需要使用 str.maketrans 创建一个 table:

import csv

with open("test.txt") as f, open('test_output.txt', "w") as out:
    wr = csv.writer(out, delimiter=",")
    table = str.maketrans("|",",")
    for line in f:
        wr.writerow(list(filter(None, line.rstrip().translate(table).split(","))

另一种方法是只拆分“|”并过滤逗号和空字符串:

with open("in.txt") as f, open('test_output.txt', "w") as out:
    wr = csv.writer(out, delimiter=",")
    for line in f:
        wr.writerow(filter(lambda x: x not in  {",",""},line.rstrip().split("|")))