python 中的文件处理
file processing in python
我正在使用 Python 处理文本文件。
我有一个文本文件 (ctl_Files.txt),其中包含以下内容/或类似内容:
------------------------
Changeset: 143
User: Sarfaraz
Date: Tuesday, April 05, 2011 5:34:54 PM
Comment:
Initial add, all objects.
Items:
add $/Systems/DB/Expences/Loader
add $/Systems/DB/Expences/Loader/AAA.txt
add $/Systems/DB/Expences/Loader/BBB.txt
add $/Systems/DB/Expences/Loader/CCC.txt
Check-in Notes:
Code Reviewer:
Performance Reviewer:
Reviewer:
Security Reviewer:
------------------------
Changeset: 145
User: Sarfaraz
Date: Thursday, April 07, 2011 5:34:54 PM
Comment:
edited objects.
Items:
edit $/Systems/DB/Expences/Loader
edit $/Systems/DB/Expences/Loader/AAA.txt
edit $/Systems/DB/Expences/Loader/AAB.txt
Check-in Notes:
Code Reviewer:
Performance Reviewer:
Reviewer:
Security Reviewer:
------------------------
Changeset: 147
User: Sarfaraz
Date: Wednesday, April 06, 2011 5:34:54 PM
Comment:
Initial add, all objects.
Items:
delete, source rename $/Systems/DB/Expences/Loader/AAA.txt;X34892
rename $/Systems/DB/Expences/Loader/AAC.txt.
Check-in Notes:
Code Reviewer:
Performance Reviewer:
Reviewer:
Security Reviewer:
------------------------
为了处理这个文件,我编写了以下代码:
#Tags - used for spliting the information
tag1 = 'Changeset:'
tag2 = 'User:'
tag3 = 'Date:'
tag4 = 'Comment:'
tag5 = 'Items:'
tag6 = 'Check-in Notes:'
#opening and reading the input file
#In path to input file use '\' as escape character
with open ("C:\Users\md_sarfaraz\Desktop\ctl_Files.txt", "r") as myfile:
val=myfile.read().replace('\n', ' ')
#counting the occurence of any one of the above tag
#As count will be same for all the tags
occurence = val.count(tag1)
#initializing row variable
row=""
#passing the count - occurence to the loop
for count in range(1, occurence+1):
row += ( (val.split(tag1)[count].split(tag2)[0]).strip() + '|' \
+ (val.split(tag2)[count].split(tag3)[0]).strip() + '|' \
+ (val.split(tag3)[count].split(tag4)[0]).strip() + '|' \
+ (val.split(tag4)[count].split(tag5)[0]).strip() + '|' \
+ (val.split(tag5)[count].split(tag6)[0]).strip() + '\n')
#opening and writing the output file
#In path to output file use '\' as escape character
file = open("C:\Users\md_sarfaraz\Desktop\processed_ctl_Files.txt", "w+")
file.write(row)
file.close()
并得到以下 result/File (processed_ctl_Files.txt):
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader add $/Systems/DB/Expences/Loader/AAA.txt add $/Systems/DB/Expences/Loader/BBB.txt add $/Systems/DB/Expences/Loader/CCC.txt
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader edit $/Systems/DB/Expences/Loader/AAA.txt edit $/Systems/DB/Expences/Loader/AAB.txt
147|Sarfaraz|Wednesday, April 06, 2011 5:34:54 PM|Initial add, all objects.|delete, source rename $/Systems/DB/Rascal/Expences/AAA.txt;X34892 rename $/Systems/DB/Rascal/Expences/AAC.txt.
但是,我想要这样的结果:
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader
add $/Systems/DB/Expences/Loader/AAA.txt
add $/Systems/DB/Expences/Loader/BBB.txt
add $/Systems/DB/Expences/Loader/CCC.txt
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader
edit $/Systems/DB/Expences/Loader/AAA.txt
edit $/Systems/DB/Expences/Loader/AAB.txt
147|Sarfaraz|Wednesday, April 06, 2011 5:34:54 PM|Initial add, all objects.|delete, source rename $/Systems/DB/Rascal/Expences/AAA.txt;X34892
rename $/Systems/DB/Rascal/Expences/AAC.txt.
要是能得到这样的结果就好了:
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader/AAA.txt
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader/BBB.txt
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader/CCC.txt
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader/AAA.txt
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader/AAB.txt
147|Sarfaraz|Wednesday, April 06, 2011 5:34:54 PM|Initial add, all objects.|delete, source rename $/Systems/DB/Rascal/Expences/AAA.txt;X34892
147|Sarfaraz|Wednesday, April 06, 2011 5:34:54 PM|Initial add, all objects.|rename $/Systems/DB/Rascal/Expences/AAC.txt.
让我知道我该怎么做。另外,我是 Python 的新手,所以如果我写了一些糟糕或多余的代码,请忽略。并帮助我改进它。
我将从将值提取到变量中开始。然后从前几个标签创建一个前缀。您可以计算前缀中的字符数并将其用于填充。当您到达项目时,将第一个附加到前缀,任何其他项目都可以附加到根据您需要的空格数创建的填充。
# keywords used in the tag "Items: "
keywords = ['add', 'delete', 'edit', 'source', 'rename']
#passing the count - occurence to the loop
for cs in val.split(tag1)[1:]:
changeset = cs.split(tag2)[0].strip()
user = cs.split(tag2)[1].split(tag3)[0].strip()
date = cs.split(tag3)[1].split(tag4)[0].strip()
comment = cs.split(tag4)[1].split(tag5)[0].strip()
items = cs.split(tag5)[1].split(tag6)[0].strip().split()
notes = cs.split(tag6)
prefix = '{0}|{1}|{2}|{3}'.format(changeset, user, date, comment)
space_count = len(prefix)
i = 0
while i < len(items):
# if we are printing the first item, add it to the other text
if i == 0:
pref = prefix
# otherwise create padding from spaces
else:
pref = ' '*space_count
# add all keywords
words = ''
for j in range(i, len(items)):
if items[j] in keywords:
words += ' ' + items[j]
else:
break
if i >= len(items): break
row += '{0}|{1} {2}\n'.format(pref, words, items[j])
i += j - i + 1 # increase by the number of keywords + the param
这似乎可以满足您的要求,但我不确定这是否是最佳解决方案。也许逐行处理文件并将值直接打印到流中更好?
您可以使用正则表达式搜索'add'、'edit'等
import re
#Tags - used for spliting the information
tag1 = 'Changeset:'
tag2 = 'User:'
tag3 = 'Date:'
tag4 = 'Comment:'
tag5 = 'Items:'
tag6 = 'Check-in Notes:'
#opening and reading the input file
#In path to input file use '\' as escape character
with open ("wibble.txt", "r") as myfile:
val=myfile.read().replace('\n', ' ')
#counting the occurence of any one of the above tag
#As count will be same for all the tags
occurence = val.count(tag1)
#initializing row variable
row=""
prevlen = 0
#passing the count - occurence to the loop
for count in range(1, occurence+1):
row += ( (val.split(tag1)[count].split(tag2)[0]).strip() + '|' \
+ (val.split(tag2)[count].split(tag3)[0]).strip() + '|' \
+ (val.split(tag3)[count].split(tag4)[0]).strip() + '|' \
+ (val.split(tag4)[count].split(tag5)[0]).strip() + '|' )
distance = len(row) - prevlen
row += re.sub("\s\s+([edit]|[add]|[delete]|[rename])", r"\n"+r" "*distance+r"", (val.split(tag5)[count].split(tag6)[0])) + '\r'
prevlen = len(row)
#opening and writing the output file
#In path to output file use '\' as escape character
file = open("wobble.txt", "w+")
file.write(row)
file.close()
这个解决方案不像使用正则表达式的答案那么简短,可能也没有那么有效,但它应该很容易理解。该解决方案确实使解析数据的使用更容易,因为每个部分数据都存储在字典中。
ctl_file = "ctl_Files.txt" # path of source file
processed_ctl_file = "processed_ctl_Files.txt" # path of destination file
#Tags - used for spliting the information
changeset_tag = 'Changeset:'
user_tag = 'User:'
date_tag = 'Date:'
comment_tag = 'Comment:'
items_tag = 'Items:'
checkin_tag = 'Check-in Notes:'
section_separator = "------------------------"
changesets = []
#open and read the input file
with open(ctl_file, 'r') as read_file:
first_section = True
changeset_dict = {}
items = []
comment_stage = False
items_stage = False
checkin_dict = {}
# Read one line at a time
for line in read_file:
# Check which tag matches the current line and store the data to matching key in the dictionary
if changeset_tag in line:
changeset = line.split(":")[1].strip()
changeset_dict[changeset_tag] = changeset
elif user_tag in line:
user = line.split(":")[1].strip()
changeset_dict[user_tag] = user
elif date_tag in line:
date = line.split(":")[1].strip()
changeset_dict[date_tag] = date
elif comment_tag in line:
comment_stage = True
elif items_tag in line:
items_stage = True
elif checkin_tag in line:
pass # not implemented due to example file not containing any data
elif section_separator in line: # new section
if first_section:
first_section = False
continue
tmp = changeset_dict
changesets.append(tmp)
changeset_dict = {}
items = []
# Set stages to false just in case
items_stage = False
comment_stage = False
elif not line.strip(): # empty line
if items_stage:
changeset_dict[items_tag] = items
items_stage = False
comment_stage = False
else:
if comment_stage:
changeset_dict[comment_tag] = line.strip() # Only works for one line comment
elif items_stage:
items.append(line.strip())
#open and write to the output file
with open(processed_ctl_file, 'w') as write_file:
for changeset in changesets:
row = "{0}|{1}|{2}|{3}|".format(changeset[changeset_tag], changeset[user_tag], changeset[date_tag], changeset[comment_tag])
distance = len(row)
items = changeset[items_tag]
join_string = "\n" + distance * " "
items_part = str.join(join_string, items)
row += items_part + "\n"
write_file.write(row)
另外,尽量使用描述其内容的变量名。像tag1、tag2等名字,并没有过多的说明变量内容。这使得代码难以阅读,尤其是当脚本变长时。在大多数情况下,可读性似乎并不重要,但是当重新访问旧代码时,需要更长的时间才能理解代码对非描述变量的作用。
我正在使用 Python 处理文本文件。 我有一个文本文件 (ctl_Files.txt),其中包含以下内容/或类似内容:
------------------------
Changeset: 143
User: Sarfaraz
Date: Tuesday, April 05, 2011 5:34:54 PM
Comment:
Initial add, all objects.
Items:
add $/Systems/DB/Expences/Loader
add $/Systems/DB/Expences/Loader/AAA.txt
add $/Systems/DB/Expences/Loader/BBB.txt
add $/Systems/DB/Expences/Loader/CCC.txt
Check-in Notes:
Code Reviewer:
Performance Reviewer:
Reviewer:
Security Reviewer:
------------------------
Changeset: 145
User: Sarfaraz
Date: Thursday, April 07, 2011 5:34:54 PM
Comment:
edited objects.
Items:
edit $/Systems/DB/Expences/Loader
edit $/Systems/DB/Expences/Loader/AAA.txt
edit $/Systems/DB/Expences/Loader/AAB.txt
Check-in Notes:
Code Reviewer:
Performance Reviewer:
Reviewer:
Security Reviewer:
------------------------
Changeset: 147
User: Sarfaraz
Date: Wednesday, April 06, 2011 5:34:54 PM
Comment:
Initial add, all objects.
Items:
delete, source rename $/Systems/DB/Expences/Loader/AAA.txt;X34892
rename $/Systems/DB/Expences/Loader/AAC.txt.
Check-in Notes:
Code Reviewer:
Performance Reviewer:
Reviewer:
Security Reviewer:
------------------------
为了处理这个文件,我编写了以下代码:
#Tags - used for spliting the information
tag1 = 'Changeset:'
tag2 = 'User:'
tag3 = 'Date:'
tag4 = 'Comment:'
tag5 = 'Items:'
tag6 = 'Check-in Notes:'
#opening and reading the input file
#In path to input file use '\' as escape character
with open ("C:\Users\md_sarfaraz\Desktop\ctl_Files.txt", "r") as myfile:
val=myfile.read().replace('\n', ' ')
#counting the occurence of any one of the above tag
#As count will be same for all the tags
occurence = val.count(tag1)
#initializing row variable
row=""
#passing the count - occurence to the loop
for count in range(1, occurence+1):
row += ( (val.split(tag1)[count].split(tag2)[0]).strip() + '|' \
+ (val.split(tag2)[count].split(tag3)[0]).strip() + '|' \
+ (val.split(tag3)[count].split(tag4)[0]).strip() + '|' \
+ (val.split(tag4)[count].split(tag5)[0]).strip() + '|' \
+ (val.split(tag5)[count].split(tag6)[0]).strip() + '\n')
#opening and writing the output file
#In path to output file use '\' as escape character
file = open("C:\Users\md_sarfaraz\Desktop\processed_ctl_Files.txt", "w+")
file.write(row)
file.close()
并得到以下 result/File (processed_ctl_Files.txt):
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader add $/Systems/DB/Expences/Loader/AAA.txt add $/Systems/DB/Expences/Loader/BBB.txt add $/Systems/DB/Expences/Loader/CCC.txt
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader edit $/Systems/DB/Expences/Loader/AAA.txt edit $/Systems/DB/Expences/Loader/AAB.txt
147|Sarfaraz|Wednesday, April 06, 2011 5:34:54 PM|Initial add, all objects.|delete, source rename $/Systems/DB/Rascal/Expences/AAA.txt;X34892 rename $/Systems/DB/Rascal/Expences/AAC.txt.
但是,我想要这样的结果:
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader
add $/Systems/DB/Expences/Loader/AAA.txt
add $/Systems/DB/Expences/Loader/BBB.txt
add $/Systems/DB/Expences/Loader/CCC.txt
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader
edit $/Systems/DB/Expences/Loader/AAA.txt
edit $/Systems/DB/Expences/Loader/AAB.txt
147|Sarfaraz|Wednesday, April 06, 2011 5:34:54 PM|Initial add, all objects.|delete, source rename $/Systems/DB/Rascal/Expences/AAA.txt;X34892
rename $/Systems/DB/Rascal/Expences/AAC.txt.
要是能得到这样的结果就好了:
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader/AAA.txt
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader/BBB.txt
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader/CCC.txt
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader/AAA.txt
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader/AAB.txt
147|Sarfaraz|Wednesday, April 06, 2011 5:34:54 PM|Initial add, all objects.|delete, source rename $/Systems/DB/Rascal/Expences/AAA.txt;X34892
147|Sarfaraz|Wednesday, April 06, 2011 5:34:54 PM|Initial add, all objects.|rename $/Systems/DB/Rascal/Expences/AAC.txt.
让我知道我该怎么做。另外,我是 Python 的新手,所以如果我写了一些糟糕或多余的代码,请忽略。并帮助我改进它。
我将从将值提取到变量中开始。然后从前几个标签创建一个前缀。您可以计算前缀中的字符数并将其用于填充。当您到达项目时,将第一个附加到前缀,任何其他项目都可以附加到根据您需要的空格数创建的填充。
# keywords used in the tag "Items: "
keywords = ['add', 'delete', 'edit', 'source', 'rename']
#passing the count - occurence to the loop
for cs in val.split(tag1)[1:]:
changeset = cs.split(tag2)[0].strip()
user = cs.split(tag2)[1].split(tag3)[0].strip()
date = cs.split(tag3)[1].split(tag4)[0].strip()
comment = cs.split(tag4)[1].split(tag5)[0].strip()
items = cs.split(tag5)[1].split(tag6)[0].strip().split()
notes = cs.split(tag6)
prefix = '{0}|{1}|{2}|{3}'.format(changeset, user, date, comment)
space_count = len(prefix)
i = 0
while i < len(items):
# if we are printing the first item, add it to the other text
if i == 0:
pref = prefix
# otherwise create padding from spaces
else:
pref = ' '*space_count
# add all keywords
words = ''
for j in range(i, len(items)):
if items[j] in keywords:
words += ' ' + items[j]
else:
break
if i >= len(items): break
row += '{0}|{1} {2}\n'.format(pref, words, items[j])
i += j - i + 1 # increase by the number of keywords + the param
这似乎可以满足您的要求,但我不确定这是否是最佳解决方案。也许逐行处理文件并将值直接打印到流中更好?
您可以使用正则表达式搜索'add'、'edit'等
import re
#Tags - used for spliting the information
tag1 = 'Changeset:'
tag2 = 'User:'
tag3 = 'Date:'
tag4 = 'Comment:'
tag5 = 'Items:'
tag6 = 'Check-in Notes:'
#opening and reading the input file
#In path to input file use '\' as escape character
with open ("wibble.txt", "r") as myfile:
val=myfile.read().replace('\n', ' ')
#counting the occurence of any one of the above tag
#As count will be same for all the tags
occurence = val.count(tag1)
#initializing row variable
row=""
prevlen = 0
#passing the count - occurence to the loop
for count in range(1, occurence+1):
row += ( (val.split(tag1)[count].split(tag2)[0]).strip() + '|' \
+ (val.split(tag2)[count].split(tag3)[0]).strip() + '|' \
+ (val.split(tag3)[count].split(tag4)[0]).strip() + '|' \
+ (val.split(tag4)[count].split(tag5)[0]).strip() + '|' )
distance = len(row) - prevlen
row += re.sub("\s\s+([edit]|[add]|[delete]|[rename])", r"\n"+r" "*distance+r"", (val.split(tag5)[count].split(tag6)[0])) + '\r'
prevlen = len(row)
#opening and writing the output file
#In path to output file use '\' as escape character
file = open("wobble.txt", "w+")
file.write(row)
file.close()
这个解决方案不像使用正则表达式的答案那么简短,可能也没有那么有效,但它应该很容易理解。该解决方案确实使解析数据的使用更容易,因为每个部分数据都存储在字典中。
ctl_file = "ctl_Files.txt" # path of source file
processed_ctl_file = "processed_ctl_Files.txt" # path of destination file
#Tags - used for spliting the information
changeset_tag = 'Changeset:'
user_tag = 'User:'
date_tag = 'Date:'
comment_tag = 'Comment:'
items_tag = 'Items:'
checkin_tag = 'Check-in Notes:'
section_separator = "------------------------"
changesets = []
#open and read the input file
with open(ctl_file, 'r') as read_file:
first_section = True
changeset_dict = {}
items = []
comment_stage = False
items_stage = False
checkin_dict = {}
# Read one line at a time
for line in read_file:
# Check which tag matches the current line and store the data to matching key in the dictionary
if changeset_tag in line:
changeset = line.split(":")[1].strip()
changeset_dict[changeset_tag] = changeset
elif user_tag in line:
user = line.split(":")[1].strip()
changeset_dict[user_tag] = user
elif date_tag in line:
date = line.split(":")[1].strip()
changeset_dict[date_tag] = date
elif comment_tag in line:
comment_stage = True
elif items_tag in line:
items_stage = True
elif checkin_tag in line:
pass # not implemented due to example file not containing any data
elif section_separator in line: # new section
if first_section:
first_section = False
continue
tmp = changeset_dict
changesets.append(tmp)
changeset_dict = {}
items = []
# Set stages to false just in case
items_stage = False
comment_stage = False
elif not line.strip(): # empty line
if items_stage:
changeset_dict[items_tag] = items
items_stage = False
comment_stage = False
else:
if comment_stage:
changeset_dict[comment_tag] = line.strip() # Only works for one line comment
elif items_stage:
items.append(line.strip())
#open and write to the output file
with open(processed_ctl_file, 'w') as write_file:
for changeset in changesets:
row = "{0}|{1}|{2}|{3}|".format(changeset[changeset_tag], changeset[user_tag], changeset[date_tag], changeset[comment_tag])
distance = len(row)
items = changeset[items_tag]
join_string = "\n" + distance * " "
items_part = str.join(join_string, items)
row += items_part + "\n"
write_file.write(row)
另外,尽量使用描述其内容的变量名。像tag1、tag2等名字,并没有过多的说明变量内容。这使得代码难以阅读,尤其是当脚本变长时。在大多数情况下,可读性似乎并不重要,但是当重新访问旧代码时,需要更长的时间才能理解代码对非描述变量的作用。