python 中的文件处理

file processing in python

我正在使用 Python 处理文本文件。 我有一个文本文件 (ctl_Files.txt),其中包含以下内容/或类似内容:

------------------------
Changeset: 143
User: Sarfaraz
Date: Tuesday, April 05, 2011 5:34:54 PM

Comment:
  Initial add, all objects.

Items:
  add $/Systems/DB/Expences/Loader
  add $/Systems/DB/Expences/Loader/AAA.txt
  add $/Systems/DB/Expences/Loader/BBB.txt
  add $/Systems/DB/Expences/Loader/CCC.txt  

Check-in Notes:
  Code Reviewer:
  Performance Reviewer:
  Reviewer:
  Security Reviewer:
------------------------
Changeset: 145
User: Sarfaraz
Date: Thursday, April 07, 2011 5:34:54 PM

Comment:
  edited objects.

Items:
  edit $/Systems/DB/Expences/Loader
  edit $/Systems/DB/Expences/Loader/AAA.txt
  edit $/Systems/DB/Expences/Loader/AAB.txt  

Check-in Notes:
  Code Reviewer:
  Performance Reviewer:
  Reviewer:
  Security Reviewer:
------------------------
Changeset: 147
User: Sarfaraz
Date: Wednesday, April 06, 2011 5:34:54 PM

Comment:
  Initial add, all objects.

Items:
  delete, source rename $/Systems/DB/Expences/Loader/AAA.txt;X34892
  rename                $/Systems/DB/Expences/Loader/AAC.txt.

Check-in Notes:
  Code Reviewer:
  Performance Reviewer:
  Reviewer:
  Security Reviewer:
------------------------

为了处理这个文件,我编写了以下代码:

#Tags - used for spliting the information

tag1 = 'Changeset:'
tag2 = 'User:'
tag3 = 'Date:'
tag4 = 'Comment:'
tag5 = 'Items:'
tag6 = 'Check-in Notes:'

#opening and reading the input file
#In path to input file use '\' as escape character
with open ("C:\Users\md_sarfaraz\Desktop\ctl_Files.txt", "r") as myfile:
    val=myfile.read().replace('\n', ' ')

#counting the occurence of any one of the above tag
#As count will be same for all the tags
occurence = val.count(tag1)

#initializing row variable
row=""

#passing the count - occurence to the loop
for count in  range(1, occurence+1):
   row += ( (val.split(tag1)[count].split(tag2)[0]).strip() + '|' \
    + (val.split(tag2)[count].split(tag3)[0]).strip() + '|' \
    + (val.split(tag3)[count].split(tag4)[0]).strip() + '|' \
    + (val.split(tag4)[count].split(tag5)[0]).strip() + '|' \
    + (val.split(tag5)[count].split(tag6)[0]).strip() + '\n')

#opening and writing the output file
#In path to output file use '\' as escape character
file = open("C:\Users\md_sarfaraz\Desktop\processed_ctl_Files.txt", "w+") 
file.write(row)
file.close()

并得到以下 result/File (processed_ctl_Files.txt):

143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader   add $/Systems/DB/Expences/Loader/AAA.txt   add $/Systems/DB/Expences/Loader/BBB.txt   add $/Systems/DB/Expences/Loader/CCC.txt
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader   edit $/Systems/DB/Expences/Loader/AAA.txt   edit $/Systems/DB/Expences/Loader/AAB.txt
147|Sarfaraz|Wednesday, April 06, 2011 5:34:54 PM|Initial add, all objects.|delete, source rename $/Systems/DB/Rascal/Expences/AAA.txt;X34892   rename                $/Systems/DB/Rascal/Expences/AAC.txt.

但是,我想要这样的结果:

143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader   
                                                                          add $/Systems/DB/Expences/Loader/AAA.txt   
                                                                          add $/Systems/DB/Expences/Loader/BBB.txt   
                                                                          add $/Systems/DB/Expences/Loader/CCC.txt
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader   
                                                                 edit $/Systems/DB/Expences/Loader/AAA.txt   
                                                                 edit $/Systems/DB/Expences/Loader/AAB.txt
147|Sarfaraz|Wednesday, April 06, 2011 5:34:54 PM|Initial add, all objects.|delete, source rename $/Systems/DB/Rascal/Expences/AAA.txt;X34892   
                                                                            rename                $/Systems/DB/Rascal/Expences/AAC.txt.

要是能得到这样的结果就好了:

143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader   
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader/AAA.txt   
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader/BBB.txt   
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader/CCC.txt
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader   
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader/AAA.txt   
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader/AAB.txt
147|Sarfaraz|Wednesday, April 06, 2011 5:34:54 PM|Initial add, all objects.|delete, source rename $/Systems/DB/Rascal/Expences/AAA.txt;X34892   
147|Sarfaraz|Wednesday, April 06, 2011 5:34:54 PM|Initial add, all objects.|rename                $/Systems/DB/Rascal/Expences/AAC.txt.

让我知道我该怎么做。另外,我是 Python 的新手,所以如果我写了一些糟糕或多余的代码,请忽略。并帮助我改进它。

我将从将值提取到变量中开始。然后从前几个标签创建一个前缀。您可以计算前缀中的字符数并将其用于填充。当您到达项目时,将第一个附加到前缀,任何其他项目都可以附加到根据您需要的空格数创建的填充。

# keywords used in the tag "Items: "
keywords = ['add', 'delete', 'edit', 'source', 'rename']

#passing the count - occurence to the loop
for cs in val.split(tag1)[1:]:
    changeset =  cs.split(tag2)[0].strip()
    user = cs.split(tag2)[1].split(tag3)[0].strip()
    date = cs.split(tag3)[1].split(tag4)[0].strip()
    comment = cs.split(tag4)[1].split(tag5)[0].strip()
    items = cs.split(tag5)[1].split(tag6)[0].strip().split()
    notes = cs.split(tag6)
    prefix = '{0}|{1}|{2}|{3}'.format(changeset, user, date, comment)
    space_count = len(prefix)
    i = 0
    while i < len(items):
        # if we are printing the first item, add it to the other text
        if i == 0:
            pref = prefix
        # otherwise create padding from spaces
        else:
            pref = ' '*space_count
        # add all keywords
        words = ''
        for j in range(i, len(items)):
            if items[j] in keywords:
                words += ' ' + items[j]
            else:
                break
        if i >= len(items): break
        row += '{0}|{1} {2}\n'.format(pref, words, items[j])
        i += j - i + 1 # increase by the number of keywords + the param

这似乎可以满足您的要求,但我不确定这是否是最佳解决方案。也许逐行处理文件并将值直接打印到流中更好?

您可以使用正则表达式搜索'add'、'edit'等

import re 

#Tags - used for spliting the information 
tag1 = 'Changeset:'
tag2 = 'User:'
tag3 = 'Date:'
tag4 = 'Comment:'
tag5 = 'Items:'
tag6 = 'Check-in Notes:'

#opening and reading the input file
#In path to input file use '\' as escape character
with open ("wibble.txt", "r") as myfile:
    val=myfile.read().replace('\n', ' ') 

#counting the occurence of any one of the above tag
#As count will be same for all the tags
occurence = val.count(tag1)

#initializing row variable
row=""

prevlen = 0

#passing the count - occurence to the loop
for count in  range(1, occurence+1):
   row += ( (val.split(tag1)[count].split(tag2)[0]).strip() + '|' \
    + (val.split(tag2)[count].split(tag3)[0]).strip() + '|' \
    + (val.split(tag3)[count].split(tag4)[0]).strip() + '|' \
    + (val.split(tag4)[count].split(tag5)[0]).strip() + '|' )

   distance = len(row) - prevlen
   row += re.sub("\s\s+([edit]|[add]|[delete]|[rename])", r"\n"+r" "*distance+r"", (val.split(tag5)[count].split(tag6)[0])) + '\r'
   prevlen = len(row)

#opening and writing the output file
#In path to output file use '\' as escape character
file = open("wobble.txt", "w+")
file.write(row)
file.close()

这个解决方案不像使用正则表达式的答案那么简短,可能也没有那么有效,但它应该很容易理解。该解决方案确实使解析数据的使用更容易,因为每个部分数据都存储在字典中。

    ctl_file = "ctl_Files.txt" # path of source file
    processed_ctl_file = "processed_ctl_Files.txt" # path of destination file

    #Tags - used for spliting the information
    changeset_tag = 'Changeset:'
    user_tag = 'User:'
    date_tag = 'Date:'
    comment_tag = 'Comment:'
    items_tag = 'Items:'
    checkin_tag = 'Check-in Notes:'

    section_separator = "------------------------"
    changesets = []

    #open and read the input file
    with open(ctl_file, 'r') as read_file:
        first_section = True
        changeset_dict = {}
        items = []
        comment_stage = False
        items_stage = False
        checkin_dict = {}
        # Read one line at a time
        for line in read_file:
            # Check which tag matches the current line and store the data to matching key in the dictionary
            if changeset_tag in line:
                changeset = line.split(":")[1].strip()
                changeset_dict[changeset_tag] = changeset
            elif user_tag in line:
                user = line.split(":")[1].strip()
                changeset_dict[user_tag] = user
            elif date_tag in line:
                date = line.split(":")[1].strip()
                changeset_dict[date_tag] = date
            elif comment_tag in line:
                comment_stage = True
            elif items_tag in line:
                items_stage = True
            elif checkin_tag in line:
                pass                        # not implemented due to example file not containing any data
            elif section_separator in line: # new section
                if first_section:
                    first_section = False
                    continue
                tmp = changeset_dict
                changesets.append(tmp)          
                changeset_dict = {}
                items = []
                # Set stages to false just in case
                items_stage = False
                comment_stage = False
            elif not line.strip():  # empty line
                if items_stage:
                    changeset_dict[items_tag] = items
                    items_stage = False
                comment_stage = False
            else:
                if comment_stage:
                    changeset_dict[comment_tag] = line.strip()  # Only works for one line comment  
                elif items_stage:
                    items.append(line.strip())

    #open and write to the output file
    with open(processed_ctl_file, 'w') as write_file:
        for changeset in changesets:        
            row = "{0}|{1}|{2}|{3}|".format(changeset[changeset_tag], changeset[user_tag], changeset[date_tag], changeset[comment_tag])
            distance = len(row)
            items = changeset[items_tag]
            join_string = "\n" + distance * " "
            items_part = str.join(join_string, items)
            row += items_part + "\n"
            write_file.write(row)

另外,尽量使用描述其内容的变量名。像tag1、tag2等名字,并没有过多的说明变量内容。这使得代码难以阅读,尤其是当脚本变长时。在大多数情况下,可读性似乎并不重要,但是当重新访问旧代码时,需要更长的时间才能理解代码对非描述变量的作用。