从一个文件中取出单词,在第二个文件中找到它,如果找到,则使用 Python 在第三个文件中写入整行

Take words from one file, find it in second and if found write whole line in third file using Python

我想比较两个文件,一个有一个索引列表,第二个文件有索引,下面是它的内容。 (请参阅我的文件示例和所需的输出)我想编写一个代码,以便它逐行检查第二个文件中第一个文件中的单词,如果找到,它将在第三个文件中写入整行。


文件一:(文件仅包含索引)

CHEA B C13279
CHEA B C13281
CHEA B C13305
CHEA B C14782
CHEA B C15292
CHEA B C15296
CHEA B C15298
CHEA B C15324
CHEA B C15406
CHEA B C15409


文件二:(文件包含索引和内容)

('CHEA B C13279', 'CHE', 'CHK', '0', 0),
('CHEA B C13281', 'CHE', 'CHK', '0', 0),
('CHEA B C13305', 'CHE', 'CEM', '491', 0),
('CHEA B C14782', 'PHY', 'EI', '17/15', 0),
('CHEA B C15292', 'CHE', 'IEM', '767', 0),
('CHEA B C15296', 'CHE', 'IEM', '767', 0),
('CHEA B C15298', 'CHE', 'IEM', '767', 0),
('CHEA B C15324', 'CHE', 'IEM', '767', 0),
('CHEA B C15406', 'CHE', 'IEM', '769', 0),
('CHEA B C15409', 'CHE', 'IEM', '769', 0),
('CHEA B C15568', 'CHE', 'Elo', '3', 0),
('CHEA B C15571', 'CHE', 'Elo', '0234', 0),
('CHEA B C15575', 'CHE', 'Elo', '0526', 0),
('CHEA B C15577', 'CHE', 'Elo', '260', 0),
('CHEA B C15583', 'CHE', 'Elo','340', 0),
('CHEA B C15587', 'CHE', 'Elo','63', 0),
('CHEA B C15590', 'CHE', 'Elo','325', 0),
('CHEA B C15592', 'CHE', 'Elo','066', 0),
('CHEA B C15599', 'CHE', 'Elo','536', 0);


我的代码

def findLineByIndex():
    count = 0

    readindexFile = open(indexFilename, 'r')
    readdataFile = open(datafile, 'r').readlines()

    for index in readindexFile:
        if index[2:index.find(2:index[2:].find('\'')+2)] in readdataFile:
            count += 1
            print line

    print count, ""

if __name__ == "__main__":
    findLineByIndex()


结果,我在第三个文件中找:

('CHEA B C13279', 'CHE', 'CHK', '0', 0),
('CHEA B C13281', 'CHE', 'CHK', '0', 0),
('CHEA B C13305', 'CHE', 'CEM', '491', 0),
('CHEA B C14782', 'PHY', 'EI', '17/15', 0),
('CHEA B C15292', 'CHE', 'IEM', '767', 0),
('CHEA B C15296', 'CHE', 'IEM', '767', 0),
('CHEA B C15298', 'CHE', 'IEM', '767', 0),
('CHEA B C15324', 'CHE', 'IEM', '767', 0),
('CHEA B C15406', 'CHE', 'IEM', '769', 0),
('CHEA B C15409', 'CHE', 'IEM', '769', 0),

这是一个简单的解决方案。它假定文件应命名为 file1、file2 和 file3。

with open('file1') as f:
    file1_text = f.readlines()
with open('file2') as f:
    file2_text = f.readlines()
with open('file3', 'w') as out:
    for line1 in file1_text:
        line1 = line1.rstrip()
        for line2 in file2_text:
            if line1 in line2:
                out.write(line2)

由于 readlines() 函数 returns 一个列表,您需要遍历所有这些条目并检查索引是否在其中。

for index in readindexFile:
    for entry in readdataFile:
        if index in entry:
            count += 1
            <write_entry>

这是一种时间复杂度为 O(n) 的替代方法。其他可能的解决方案可能是 O(n^2) - 您会注意到大输入的差异。

with open('file2') as file2:
    file2_dict = {line.split(',')[0].strip("'(") : line for line in file2}

with open('file1') as file1, open('output', 'w') as output:
    for key in file1:
        file2_line = file2_dict.get(key.strip())
        if file2_line is not None:
            output.write(file2_line)

首先,根据file2的内容构建字典。字典键是元组的第一个元素(对应于 file1 中的值)。该值是行本身。

然后遍历 file1 中的键并在字典中查找键。如果找到,将其写入输出文件。结果将是:

('CHEA B C13279', 'CHE', 'CHK', '0', 0),
('CHEA B C13281', 'CHE', 'CHK', '0', 0),
('CHEA B C13305', 'CHE', 'CEM', '491', 0),
('CHEA B C14782', 'PHY', 'EI', '17/15', 0),
('CHEA B C15292', 'CHE', 'IEM', '767', 0),
('CHEA B C15296', 'CHE', 'IEM', '767', 0),
('CHEA B C15298', 'CHE', 'IEM', '767', 0),
('CHEA B C15324', 'CHE', 'IEM', '767', 0),
('CHEA B C15406', 'CHE', 'IEM', '769', 0),
('CHEA B C15409', 'CHE', 'IEM', '769', 0),

上述方法假定每一行对于每个键都是唯一的。如果同一键可以有不同的行,则可以使用 defaultdict 个列表:

from collections import defaultdict

file2_dict = defaultdict(list)

with open('file2') as file2:
    for line in file2:
        key = line.split(',')[0].strip("'(")
        file2_dict[key].append(line)

with open('file1') as file1, open('output', 'w') as output:
    for key in file1:
        file2_lines = file2_dict.get(key.strip())
        if file2_lines is not None:
            output.writelines(file2_lines)

现在每个键映射到一个列表匹配行,所有行都被输出。