Python difflib 给出了不好的结果

Python difflib gives bad results

我正在使用 python difflib 来计算两个明文英文段落之间的差异。

段落非常相似——其中一段多了一个开头句和结尾句。角色之间也有细微差别。

不幸的是,我的结果很糟糕。似乎 diff 开头的一个字符将其丢弃,并在整个过程中散布随机字符。

diffchecker.com 等网站在计算差异时没有问题。我还注意到,如果我减少 difflib 的 window 以忽略第一句话,它会正确计算 diff。还有其他人注意到这个问题吗?

附上我的代码和下面的示例段落。非常感谢。

import difflib

s1 = "Ableton Live also supports Audio To MIDI, which converts audio samples into a sequence of MIDI notes using three different conversion methods including conversion to Melody, Harmony, or Rhythm. Once finished, Live will create a new MIDI track containing the fresh MIDI notes along with an instrument to play back the notes. Audio to midi conversion is not always 100% accurate and may require the artist or producer to manually adjust some notes.[14] See Fourier transform.Envelopes[edit]Almost all of the parameters in Live can be automated by envelopes which may be drawn either on clips, in which case they will be used in every performance of that clip, or on the entire arrangement. The most obvious examples are volume and track panning, but envelopes are also used in Live to control parameters of audio devices such as the root note of a resonator or a filter’s cutoff frequency. Clip envelopes may also be mapped to MIDI controls, which can also control parameters in real-time using sliders, faders and such. Using the global transport record function will also record changes made to these parameters, creating an envelope for them.User interface[edit]Much of Live’s interface comes from being designed for use in live performance, as well as for production.[15] There are few pop up messages or dialogs. Portions of the interface are hidden and shown based on arrows which may be clicked to show or hide a certain segment (e.g. to hide the instrument/effect list or to show or hide the help box)."
s2 = "Once finished, Live will create a new MIDI track containing the fresh MIDI notes along with an instrument to play back the notes. Audio to midi conversion is not always 100% accurate and may require the artist or producer to manually adjust some notes. [14] See Fourier transform . Envelopes[ edit ] Almost all of the parameters in Live can be automated by envelopes which may be drawn either on clips, in which case they will be used in every performance of that clip, or on the entire arrangement. The most obvious examples are volume and track panning, but envelopes are also used in Live to control parameters of audio devices such as the root note of a resonator or a filter’s cutoff frequency. Clip envelopes may also be mapped to MIDI controls, which can also control parameters in real-time using sliders, faders and such. Using the global transport record function will also record changes made to these parameters, creating an envelope for them. User interface[ edit ] Much of Live’s interface comes from being designed for use in live performance, as well as for production."

if __name__ == "__main__":
    res = [d for d in difflib.ndiff(s1, s2)]
    print(res)

正如文档所说,

Compare a and b (lists of strings) ... return a Differ-style delta (a generator generating the delta lines).

ndiff() 用于比较两个文件,给定行列表 文件包含。很像常见的 Unixy diff 实用程序。

您正在尝试比较两条单独的线。 difflib 没有内置的“漂亮印刷”方式来做到这一点,但确实提供了比较工具,您可以在其上构建您喜欢的任何格式。例如,

d = difflib.SequenceMatcher(None, s1, s2, autojunk=None)
for op in d.get_opcodes():
    print(op)

打印

('delete', 0, 194, 0, 0)
('equal', 194, 446, 0, 252)
('insert', 446, 446, 252, 253)
('equal', 446, 472, 253, 279)
('insert', 472, 472, 279, 280)
('equal', 472, 473, 280, 281)
('insert', 473, 473, 281, 282)
('equal', 473, 483, 282, 292)
('insert', 483, 483, 292, 293)
('equal', 483, 487, 293, 297)
('insert', 487, 487, 297, 298)
('equal', 487, 488, 298, 299)
('insert', 488, 488, 299, 300)
('equal', 488, 1143, 300, 955)
('insert', 1143, 1143, 955, 956)
('equal', 1143, 1158, 956, 971)
('insert', 1158, 1158, 971, 972)
('equal', 1158, 1162, 972, 976)
('insert', 1162, 1162, 976, 977)
('equal', 1162, 1163, 977, 978)
('insert', 1163, 1163, 978, 979)
('equal', 1163, 1269, 979, 1085)
('delete', 1269, 1508, 1085, 1085)

有关这些内容的确切含义,请参阅文档。他们简洁地描述了将 s1 更改为 s2 所需的条件。长的精确匹配块由 ('equal', 488, 1143, 300, 955) 描述,实际上,

>>> s1[488 : 1143] == s2[300 : 955]
True

建议:相反,将您的两个输入分成句子,并将每个输入视为换行符终止句子的序列(如列表)。然后你可以直接使用ndiff(),按照它的预期使用方式。

使另一种方式更具体,例如这段代码:

import difflib
d = difflib.SequenceMatcher(None, s1, s2, autojunk=None)
for op, i1, i2, j1, j2 in d.get_opcodes():
    print(">>> ", end="")
    if op == "equal":
        print(f"{i2-i1} characters the same at",
              f"{i1}:{i2} and {j1}:{j2}")
        print(s1[i1:i2])
    elif op == "delete":
        print(f"delete {i2-i1} characters at {i1}:{i2}")
        print(s1[i1:i2])
    elif op == "insert":
        print(f"insert {j2-j1} characters from {j1}:{j2}")
        print(s2[j1:j2])
    elif op == "replace":
        print(f"replace {i1}:{i2} with {j1}:{j2}")
        print(s1[i1:i2])
        print(s2[j1:j2])
    else:
        assert False, ("unknown op", repr(op))

产生这个输出:

>>> delete 194 characters at 0:194
Ableton Live also supports Audio To MIDI, which converts audio samples into a sequence of MIDI notes using three different conversion methods including conversion to Melody, Harmony, or Rhythm. 
>>> 252 characters the same at 194:446 and 0:252
Once finished, Live will create a new MIDI track containing the fresh MIDI notes along with an instrument to play back the notes. Audio to midi conversion is not always 100% accurate and may require the artist or producer to manually adjust some notes.
>>> insert 1 characters from 252:253
 
>>> 26 characters the same at 446:472 and 253:279
[14] See Fourier transform
>>> insert 1 characters from 279:280
 
>>> 1 characters the same at 472:473 and 280:281
.
>>> insert 1 characters from 281:282
 
>>> 10 characters the same at 473:483 and 282:292
Envelopes[
>>> insert 1 characters from 292:293
 
>>> 4 characters the same at 483:487 and 293:297
edit
>>> insert 1 characters from 297:298
 
>>> 1 characters the same at 487:488 and 298:299
]
>>> insert 1 characters from 299:300
 
>>> 655 characters the same at 488:1143 and 300:955
Almost all of the parameters in Live can be automated by envelopes which may be drawn either on clips, in which case they will be used in every performance of that clip, or on the entire arrangement. The most obvious examples are volume and track panning, but envelopes are also used in Live to control parameters of audio devices such as the root note of a resonator or a filter’s cutoff frequency. Clip envelopes may also be mapped to MIDI controls, which can also control parameters in real-time using sliders, faders and such. Using the global transport record function will also record changes made to these parameters, creating an envelope for them.
>>> insert 1 characters from 955:956
 
>>> 15 characters the same at 1143:1158 and 956:971
User interface[
>>> insert 1 characters from 971:972
 
>>> 4 characters the same at 1158:1162 and 972:976
edit
>>> insert 1 characters from 976:977
 
>>> 1 characters the same at 1162:1163 and 977:978
]
>>> insert 1 characters from 978:979
 
>>> 106 characters the same at 1163:1269 and 979:1085
Much of Live’s interface comes from being designed for use in live performance, as well as for production.
>>> delete 239 characters at 1269:1508
[15] There are few pop up messages or dialogs. Portions of the interface are hidden and shown based on arrows which may be clicked to show or hide a certain segment (e.g. to hide the instrument/effect list or to show or hide the help box).

您可以编辑该模板以任何您喜欢的方式显示结果。