从一个文件中取出单词,在第二个文件中找到它,如果找到,则使用 Python 在第三个文件中写入整行
Take words from one file, find it in second and if found write whole line in third file using Python
我想比较两个文件,一个有一个索引列表,第二个文件有索引,下面是它的内容。 (请参阅我的文件示例和所需的输出)我想编写一个代码,以便它逐行检查第二个文件中第一个文件中的单词,如果找到,它将在第三个文件中写入整行。
文件一:(文件仅包含索引)
CHEA B C13279
CHEA B C13281
CHEA B C13305
CHEA B C14782
CHEA B C15292
CHEA B C15296
CHEA B C15298
CHEA B C15324
CHEA B C15406
CHEA B C15409
文件二:(文件包含索引和内容)
('CHEA B C13279', 'CHE', 'CHK', '0', 0),
('CHEA B C13281', 'CHE', 'CHK', '0', 0),
('CHEA B C13305', 'CHE', 'CEM', '491', 0),
('CHEA B C14782', 'PHY', 'EI', '17/15', 0),
('CHEA B C15292', 'CHE', 'IEM', '767', 0),
('CHEA B C15296', 'CHE', 'IEM', '767', 0),
('CHEA B C15298', 'CHE', 'IEM', '767', 0),
('CHEA B C15324', 'CHE', 'IEM', '767', 0),
('CHEA B C15406', 'CHE', 'IEM', '769', 0),
('CHEA B C15409', 'CHE', 'IEM', '769', 0),
('CHEA B C15568', 'CHE', 'Elo', '3', 0),
('CHEA B C15571', 'CHE', 'Elo', '0234', 0),
('CHEA B C15575', 'CHE', 'Elo', '0526', 0),
('CHEA B C15577', 'CHE', 'Elo', '260', 0),
('CHEA B C15583', 'CHE', 'Elo','340', 0),
('CHEA B C15587', 'CHE', 'Elo','63', 0),
('CHEA B C15590', 'CHE', 'Elo','325', 0),
('CHEA B C15592', 'CHE', 'Elo','066', 0),
('CHEA B C15599', 'CHE', 'Elo','536', 0);
我的代码
def findLineByIndex():
count = 0
readindexFile = open(indexFilename, 'r')
readdataFile = open(datafile, 'r').readlines()
for index in readindexFile:
if index[2:index.find(2:index[2:].find('\'')+2)] in readdataFile:
count += 1
print line
print count, ""
if __name__ == "__main__":
findLineByIndex()
结果,我在第三个文件中找:
('CHEA B C13279', 'CHE', 'CHK', '0', 0),
('CHEA B C13281', 'CHE', 'CHK', '0', 0),
('CHEA B C13305', 'CHE', 'CEM', '491', 0),
('CHEA B C14782', 'PHY', 'EI', '17/15', 0),
('CHEA B C15292', 'CHE', 'IEM', '767', 0),
('CHEA B C15296', 'CHE', 'IEM', '767', 0),
('CHEA B C15298', 'CHE', 'IEM', '767', 0),
('CHEA B C15324', 'CHE', 'IEM', '767', 0),
('CHEA B C15406', 'CHE', 'IEM', '769', 0),
('CHEA B C15409', 'CHE', 'IEM', '769', 0),
这是一个简单的解决方案。它假定文件应命名为 file1、file2 和 file3。
with open('file1') as f:
file1_text = f.readlines()
with open('file2') as f:
file2_text = f.readlines()
with open('file3', 'w') as out:
for line1 in file1_text:
line1 = line1.rstrip()
for line2 in file2_text:
if line1 in line2:
out.write(line2)
由于 readlines() 函数 returns 一个列表,您需要遍历所有这些条目并检查索引是否在其中。
for index in readindexFile:
for entry in readdataFile:
if index in entry:
count += 1
<write_entry>
这是一种时间复杂度为 O(n) 的替代方法。其他可能的解决方案可能是 O(n^2) - 您会注意到大输入的差异。
with open('file2') as file2:
file2_dict = {line.split(',')[0].strip("'(") : line for line in file2}
with open('file1') as file1, open('output', 'w') as output:
for key in file1:
file2_line = file2_dict.get(key.strip())
if file2_line is not None:
output.write(file2_line)
首先,根据file2
的内容构建字典。字典键是元组的第一个元素(对应于 file1
中的值)。该值是行本身。
然后遍历 file1
中的键并在字典中查找键。如果找到,将其写入输出文件。结果将是:
('CHEA B C13279', 'CHE', 'CHK', '0', 0),
('CHEA B C13281', 'CHE', 'CHK', '0', 0),
('CHEA B C13305', 'CHE', 'CEM', '491', 0),
('CHEA B C14782', 'PHY', 'EI', '17/15', 0),
('CHEA B C15292', 'CHE', 'IEM', '767', 0),
('CHEA B C15296', 'CHE', 'IEM', '767', 0),
('CHEA B C15298', 'CHE', 'IEM', '767', 0),
('CHEA B C15324', 'CHE', 'IEM', '767', 0),
('CHEA B C15406', 'CHE', 'IEM', '769', 0),
('CHEA B C15409', 'CHE', 'IEM', '769', 0),
上述方法假定每一行对于每个键都是唯一的。如果同一键可以有不同的行,则可以使用 defaultdict
个列表:
from collections import defaultdict
file2_dict = defaultdict(list)
with open('file2') as file2:
for line in file2:
key = line.split(',')[0].strip("'(")
file2_dict[key].append(line)
with open('file1') as file1, open('output', 'w') as output:
for key in file1:
file2_lines = file2_dict.get(key.strip())
if file2_lines is not None:
output.writelines(file2_lines)
现在每个键映射到一个列表匹配行,所有行都被输出。
我想比较两个文件,一个有一个索引列表,第二个文件有索引,下面是它的内容。 (请参阅我的文件示例和所需的输出)我想编写一个代码,以便它逐行检查第二个文件中第一个文件中的单词,如果找到,它将在第三个文件中写入整行。
文件一:(文件仅包含索引)
CHEA B C13279
CHEA B C13281
CHEA B C13305
CHEA B C14782
CHEA B C15292
CHEA B C15296
CHEA B C15298
CHEA B C15324
CHEA B C15406
CHEA B C15409
文件二:(文件包含索引和内容)
('CHEA B C13279', 'CHE', 'CHK', '0', 0),
('CHEA B C13281', 'CHE', 'CHK', '0', 0),
('CHEA B C13305', 'CHE', 'CEM', '491', 0),
('CHEA B C14782', 'PHY', 'EI', '17/15', 0),
('CHEA B C15292', 'CHE', 'IEM', '767', 0),
('CHEA B C15296', 'CHE', 'IEM', '767', 0),
('CHEA B C15298', 'CHE', 'IEM', '767', 0),
('CHEA B C15324', 'CHE', 'IEM', '767', 0),
('CHEA B C15406', 'CHE', 'IEM', '769', 0),
('CHEA B C15409', 'CHE', 'IEM', '769', 0),
('CHEA B C15568', 'CHE', 'Elo', '3', 0),
('CHEA B C15571', 'CHE', 'Elo', '0234', 0),
('CHEA B C15575', 'CHE', 'Elo', '0526', 0),
('CHEA B C15577', 'CHE', 'Elo', '260', 0),
('CHEA B C15583', 'CHE', 'Elo','340', 0),
('CHEA B C15587', 'CHE', 'Elo','63', 0),
('CHEA B C15590', 'CHE', 'Elo','325', 0),
('CHEA B C15592', 'CHE', 'Elo','066', 0),
('CHEA B C15599', 'CHE', 'Elo','536', 0);
我的代码
def findLineByIndex():
count = 0
readindexFile = open(indexFilename, 'r')
readdataFile = open(datafile, 'r').readlines()
for index in readindexFile:
if index[2:index.find(2:index[2:].find('\'')+2)] in readdataFile:
count += 1
print line
print count, ""
if __name__ == "__main__":
findLineByIndex()
结果,我在第三个文件中找:
('CHEA B C13279', 'CHE', 'CHK', '0', 0),
('CHEA B C13281', 'CHE', 'CHK', '0', 0),
('CHEA B C13305', 'CHE', 'CEM', '491', 0),
('CHEA B C14782', 'PHY', 'EI', '17/15', 0),
('CHEA B C15292', 'CHE', 'IEM', '767', 0),
('CHEA B C15296', 'CHE', 'IEM', '767', 0),
('CHEA B C15298', 'CHE', 'IEM', '767', 0),
('CHEA B C15324', 'CHE', 'IEM', '767', 0),
('CHEA B C15406', 'CHE', 'IEM', '769', 0),
('CHEA B C15409', 'CHE', 'IEM', '769', 0),
这是一个简单的解决方案。它假定文件应命名为 file1、file2 和 file3。
with open('file1') as f:
file1_text = f.readlines()
with open('file2') as f:
file2_text = f.readlines()
with open('file3', 'w') as out:
for line1 in file1_text:
line1 = line1.rstrip()
for line2 in file2_text:
if line1 in line2:
out.write(line2)
由于 readlines() 函数 returns 一个列表,您需要遍历所有这些条目并检查索引是否在其中。
for index in readindexFile:
for entry in readdataFile:
if index in entry:
count += 1
<write_entry>
这是一种时间复杂度为 O(n) 的替代方法。其他可能的解决方案可能是 O(n^2) - 您会注意到大输入的差异。
with open('file2') as file2:
file2_dict = {line.split(',')[0].strip("'(") : line for line in file2}
with open('file1') as file1, open('output', 'w') as output:
for key in file1:
file2_line = file2_dict.get(key.strip())
if file2_line is not None:
output.write(file2_line)
首先,根据file2
的内容构建字典。字典键是元组的第一个元素(对应于 file1
中的值)。该值是行本身。
然后遍历 file1
中的键并在字典中查找键。如果找到,将其写入输出文件。结果将是:
('CHEA B C13279', 'CHE', 'CHK', '0', 0), ('CHEA B C13281', 'CHE', 'CHK', '0', 0), ('CHEA B C13305', 'CHE', 'CEM', '491', 0), ('CHEA B C14782', 'PHY', 'EI', '17/15', 0), ('CHEA B C15292', 'CHE', 'IEM', '767', 0), ('CHEA B C15296', 'CHE', 'IEM', '767', 0), ('CHEA B C15298', 'CHE', 'IEM', '767', 0), ('CHEA B C15324', 'CHE', 'IEM', '767', 0), ('CHEA B C15406', 'CHE', 'IEM', '769', 0), ('CHEA B C15409', 'CHE', 'IEM', '769', 0),
上述方法假定每一行对于每个键都是唯一的。如果同一键可以有不同的行,则可以使用 defaultdict
个列表:
from collections import defaultdict
file2_dict = defaultdict(list)
with open('file2') as file2:
for line in file2:
key = line.split(',')[0].strip("'(")
file2_dict[key].append(line)
with open('file1') as file1, open('output', 'w') as output:
for key in file1:
file2_lines = file2_dict.get(key.strip())
if file2_lines is not None:
output.writelines(file2_lines)
现在每个键映射到一个列表匹配行,所有行都被输出。