Python 比较相似或相同行的文本文件
Python Comparing text files for similar or equal lines
我有 2 个文本文件,我的目标是找到文件 First.txt 中不在 Second.txt 中的行,并将这些行输出到第三个文本文件 Missing.txt,我完成了吗:
fn = "Missing.txt"
try:
fileOutPut = open(fn, 'w')
except IOError:
fileOutPut = open(fn, 'w')
fileOutPut.truncate()
filePrimary = open('First.txt', 'r', encoding='utf-8', errors='ignore')
fileSecondary = open('Second.txt', 'r', encoding='utf-8', errors='ignore')
bLines = set([thing.strip() for thing in fileSecondary.readlines()])
for line in filePrimary:
line = line.strip()
if line in bLines:
continue
else:
fileOutPut.write(line)
fileOutPut.write('\n')
fileOutPut.close()
filePrimary.close()
fileSecondary.close()
但是在 运行 脚本之后我遇到了问题,有些行非常相似,示例:
[PR] Zero One Two Three ft Four
and(括号后无space)
[PR]Zero One Two Three ft Four
或
[PR] Zero One Two Three ft Four
和(大写F字母)
[PR] Zero One Two Three Ft Four
我找到了 SequenceMatcher,它可以满足我的要求,但我如何将其实现到比较中,因为它们不仅仅是两个字符串,而是一个字符串和一个集合
IIUC,即使白色 space 或大小写不同,您也希望匹配行。
一个简单的方法是删除白色 space 并在读取时使所有内容都相同:
import re
def format_line(line):
return re.sub("\s+", "", line.strip()).lower()
filePrimary = open('First.txt', 'r', encoding='utf-8', errors='ignore')
fileSecondary = open('Second.txt', 'r', encoding='utf-8', errors='ignore')
bLines = set([format_line(thing) for thing in fileSecondary.readlines()])
for line in filePrimary:
fline = format_line(line)
if fline in bLines:
continue
else:
fileOutPut.write(line + '\n')
更新 1:模糊匹配
如果你想模糊匹配,你可以做类似nltk.metrics.distance.edit_distance
(docs)
但是您无法将每一行与其他每一行进行比较(最坏的情况)。您失去了 in
操作的速度。
例如
from nltk.metrics.distance import edit_distance as dist
threshold = 3 # the maximum number of edits between lines
for line in filePrimary:
fline = format_line(line)
match_found = any([dist(fline, other_line) < threshold for other_line in bLines])
if not match_found:
fileOutPut.write(line + '\n')
我有 2 个文本文件,我的目标是找到文件 First.txt 中不在 Second.txt 中的行,并将这些行输出到第三个文本文件 Missing.txt,我完成了吗:
fn = "Missing.txt"
try:
fileOutPut = open(fn, 'w')
except IOError:
fileOutPut = open(fn, 'w')
fileOutPut.truncate()
filePrimary = open('First.txt', 'r', encoding='utf-8', errors='ignore')
fileSecondary = open('Second.txt', 'r', encoding='utf-8', errors='ignore')
bLines = set([thing.strip() for thing in fileSecondary.readlines()])
for line in filePrimary:
line = line.strip()
if line in bLines:
continue
else:
fileOutPut.write(line)
fileOutPut.write('\n')
fileOutPut.close()
filePrimary.close()
fileSecondary.close()
但是在 运行 脚本之后我遇到了问题,有些行非常相似,示例:
[PR] Zero One Two Three ft Four
and(括号后无space)
[PR]Zero One Two Three ft Four
或
[PR] Zero One Two Three ft Four
和(大写F字母)
[PR] Zero One Two Three Ft Four
我找到了 SequenceMatcher,它可以满足我的要求,但我如何将其实现到比较中,因为它们不仅仅是两个字符串,而是一个字符串和一个集合
IIUC,即使白色 space 或大小写不同,您也希望匹配行。
一个简单的方法是删除白色 space 并在读取时使所有内容都相同:
import re
def format_line(line):
return re.sub("\s+", "", line.strip()).lower()
filePrimary = open('First.txt', 'r', encoding='utf-8', errors='ignore')
fileSecondary = open('Second.txt', 'r', encoding='utf-8', errors='ignore')
bLines = set([format_line(thing) for thing in fileSecondary.readlines()])
for line in filePrimary:
fline = format_line(line)
if fline in bLines:
continue
else:
fileOutPut.write(line + '\n')
更新 1:模糊匹配
如果你想模糊匹配,你可以做类似nltk.metrics.distance.edit_distance
(docs)
但是您无法将每一行与其他每一行进行比较(最坏的情况)。您失去了 in
操作的速度。
例如
from nltk.metrics.distance import edit_distance as dist
threshold = 3 # the maximum number of edits between lines
for line in filePrimary:
fline = format_line(line)
match_found = any([dist(fline, other_line) < threshold for other_line in bLines])
if not match_found:
fileOutPut.write(line + '\n')