Python 比较相似或相同行的文本文件

Question

我有 2 个文本文件，我的目标是找到文件 First.txt 中不在 Second.txt 中的行，并将这些行输出到第三个文本文件 Missing.txt，我完成了吗：

fn = "Missing.txt"
try:
    fileOutPut = open(fn, 'w')
except IOError:
    fileOutPut = open(fn, 'w')
fileOutPut.truncate()
filePrimary = open('First.txt', 'r', encoding='utf-8', errors='ignore')
fileSecondary = open('Second.txt', 'r', encoding='utf-8', errors='ignore')
bLines = set([thing.strip() for thing in fileSecondary.readlines()])
for line in filePrimary:
    line = line.strip()
    if line in bLines:
        continue
    else:
        fileOutPut.write(line)
        fileOutPut.write('\n')
fileOutPut.close()
filePrimary.close()
fileSecondary.close()

但是在运行脚本之后我遇到了问题，有些行非常相似，示例：

[PR] Zero One Two Three ft Four

and（括号后无space）

[PR]Zero One Two Three ft Four

或

[PR] Zero One Two Three ft Four

和（大写F字母）

[PR] Zero One Two Three Ft Four

我找到了 SequenceMatcher，它可以满足我的要求，但我如何将其实现到比较中，因为它们不仅仅是两个字符串，而是一个字符串和一个集合

Answer 1

IIUC，即使白色 space 或大小写不同，您也希望匹配行。

一个简单的方法是删除白色 space 并在读取时使所有内容都相同：

import re

def format_line(line):
    return re.sub("\s+", "", line.strip()).lower()

filePrimary = open('First.txt', 'r', encoding='utf-8', errors='ignore')
fileSecondary = open('Second.txt', 'r', encoding='utf-8', errors='ignore')
bLines = set([format_line(thing) for thing in fileSecondary.readlines()])
for line in filePrimary:
    fline = format_line(line)
    if fline in bLines:
        continue
    else:
        fileOutPut.write(line + '\n')

更新 1：模糊匹配

如果你想模糊匹配，你可以做类似nltk.metrics.distance.edit_distance (docs) 但是您无法将每一行与其他每一行进行比较（最坏的情况）。您失去了 in 操作的速度。

例如

from nltk.metrics.distance import edit_distance as dist

threshold = 3  # the maximum number of edits between lines

for line in filePrimary:
    fline = format_line(line)
    match_found = any([dist(fline, other_line) < threshold for other_line in bLines])

    if not match_found:
        fileOutPut.write(line + '\n')

Python 比较相似或相同行的文本文件

Python Comparing text files for similar or equal lines

python

sequencematcher