如何提取两个文档中不匹配的文本

Question

假设我有两个字符串。

a = 'I am Sam. I love cooking.'

b = 'I am sam. I used to drink a lot.'

我正在计算他们的相似度得分：

from difflib import SequenceMatcher
s = SequenceMatcher(lambda x: x == " ",a,b)
print s.ratio()

现在我想在两个字符串中打印不匹配的句子。像这样

a = 'I love cooking.'

b = 'I used to drink a lot.'

任何建议，比如我可以使用什么模块或方法来做到这一点？我在 difflib 中看到了一个模块 https://pymotw.com/2/difflib/ 但是在这个模块中它打印了 (+,-,!,...) 我不想要那种格式的输出。

Answer 1

这是一个非常简单的脚本。但我希望它能让你知道如何做：

a = 'I am Sam. I love cooking.'    
b = 'I am sam. I used to drink a lot.'

a= a.split('.')
b=b.split('.')

ca=len(a)
cb=len(b)

if ca>cb:l=cb
else :l=ca

c=0

while c<l:
    if a[c].upper() == b[c].upper():pass
    else:print b[c]+'.'
    c=c+1

Answer 2

使用difflib。您可以轻松地 post 处理 difflib.Differ 的输出，去除每个单元的前两个字符并将它们转换为您想要的任何格式。或者您可以使用 SequenceMatcher.get_matching_blocks 返回的对齐方式并生成您自己的输出。

以下是您的操作方法。如果这不是您想要的，编辑您的问题以提供一个不太简单的比较示例和您需要的输出格式。

differ = difflib.Differ()
for line in differ.compare(list1, list2):
    if line.startswith("-"):
        print("a="+line[2:])
    elif line.startswith("+"):
        print("b="+line[2:])
    # else just ignore the line

如何提取两个文档中不匹配的文本

How to extract non-matching text in two documents

python

diff