Python-替换文本文件中的重复信息
Python-Replacing Duplicatte information in a Text file
我目前正在开发一个程序,将文本文件中的每个单词放入 xlsxwriter。这意味着我必须拆分线。
我的问题是我必须删除重复的信息,直到一行中第一个不同的元素。我想不出如何解决它。
文本示例
Dave likes fresh green apples
Dave likes fresh green peppers
Dave hates fresh green apples
Dave hates rotten green apples
Jane likes fresh green apples
xlsxwriter 中的期望结果
C1 C2 C3 C4 C5
R1 Dave likes fresh green apples
R2 X X X X peppers
R3 X hates fresh green apples
R4 X X rotten green apples
R5 Jane likes fresh green apples
谢谢
接受挑战。
这样的事情怎么样:
test.txt
Dave likes fresh green apples
Dave likes fresh green peppers
Dave hates fresh green apples
Dave hates rotten green apples
Jane likes fresh green apples
Dave likes fresh green watermelon
Jane likes fresh green peppers
这里是我的第一个想法(让它发挥作用并记录在我原来的 post 中)
def read_lines_with_duplicate_replace_v1(path,replace_char="X"):
"""Generator that read the lines in the file contained in path
and for each line that start as some previous line replace each
part that is similar with replace_char. Yield a list with the result"""
#assume that each line has the same number of elements
record=dict()
with open(path) as file:
for line in file:
result = line.split()
temp = tuple(result)
if temp[0] in record:
key = result[0]
result[0] = replace_char
for i in range(1,len(result)):
if result[i] == record[key][i-1]:
result[i] = replace_char
else:
break
record[temp[0]] = temp[1:]
yield result
这里是第二个思路,只记得上一行
def read_lines_with_duplicate_replace_v2(path,replace_char="X"):
"""Generator that read the lines in the file contained in path
and for each line that start as the previous line replace each
part that is similar with replace_char. Yield a list with the result """
#assume that each line has the same number of elements
num_elem = 0
previous_line = list()
with open(path) as file:
for line in file:
result = line.split()
if previous_line:
for i in range(num_elem):
if result[i] == previous_line[i]:
result[i] = replace_char
else:
break
previous_line[i:] = result[i:]
else:
previous_line.extend(result)
num_elem = len(previous_line)
yield result
输出:
>>> for x in read_lines_with_duplicate_replace_v1("test.txt"):
print(*x)
Dave likes fresh green apples
X X X X peppers
X hates fresh green apples
X X rotten green apples
Jane likes fresh green apples
X likes fresh green watermelon
X X X X peppers
>>>
>>>
>>> for x in read_lines_with_duplicate_replace_v2("test.txt"):
print(*x)
Dave likes fresh green apples
X X X X peppers
X hates fresh green apples
X X rotten green apples
Jane likes fresh green apples
Dave likes fresh green watermelon
Jane likes fresh green peppers
>>>
我目前正在开发一个程序,将文本文件中的每个单词放入 xlsxwriter。这意味着我必须拆分线。
我的问题是我必须删除重复的信息,直到一行中第一个不同的元素。我想不出如何解决它。
文本示例
Dave likes fresh green apples
Dave likes fresh green peppers
Dave hates fresh green apples
Dave hates rotten green apples
Jane likes fresh green apples
xlsxwriter 中的期望结果
C1 C2 C3 C4 C5
R1 Dave likes fresh green apples
R2 X X X X peppers
R3 X hates fresh green apples
R4 X X rotten green apples
R5 Jane likes fresh green apples
谢谢
接受挑战。
这样的事情怎么样:
test.txt
Dave likes fresh green apples
Dave likes fresh green peppers
Dave hates fresh green apples
Dave hates rotten green apples
Jane likes fresh green apples
Dave likes fresh green watermelon
Jane likes fresh green peppers
这里是我的第一个想法(让它发挥作用并记录在我原来的 post 中)
def read_lines_with_duplicate_replace_v1(path,replace_char="X"):
"""Generator that read the lines in the file contained in path
and for each line that start as some previous line replace each
part that is similar with replace_char. Yield a list with the result"""
#assume that each line has the same number of elements
record=dict()
with open(path) as file:
for line in file:
result = line.split()
temp = tuple(result)
if temp[0] in record:
key = result[0]
result[0] = replace_char
for i in range(1,len(result)):
if result[i] == record[key][i-1]:
result[i] = replace_char
else:
break
record[temp[0]] = temp[1:]
yield result
这里是第二个思路,只记得上一行
def read_lines_with_duplicate_replace_v2(path,replace_char="X"):
"""Generator that read the lines in the file contained in path
and for each line that start as the previous line replace each
part that is similar with replace_char. Yield a list with the result """
#assume that each line has the same number of elements
num_elem = 0
previous_line = list()
with open(path) as file:
for line in file:
result = line.split()
if previous_line:
for i in range(num_elem):
if result[i] == previous_line[i]:
result[i] = replace_char
else:
break
previous_line[i:] = result[i:]
else:
previous_line.extend(result)
num_elem = len(previous_line)
yield result
输出:
>>> for x in read_lines_with_duplicate_replace_v1("test.txt"):
print(*x)
Dave likes fresh green apples
X X X X peppers
X hates fresh green apples
X X rotten green apples
Jane likes fresh green apples
X likes fresh green watermelon
X X X X peppers
>>>
>>>
>>> for x in read_lines_with_duplicate_replace_v2("test.txt"):
print(*x)
Dave likes fresh green apples
X X X X peppers
X hates fresh green apples
X X rotten green apples
Jane likes fresh green apples
Dave likes fresh green watermelon
Jane likes fresh green peppers
>>>