Python Difflib 的 SequenceMatcher 没有找到最长公共子串

Python Difflib's SequenceMatcher does not find Longest Common Substrings

我想用difflib.SequenceMatcher从两个字符串中提取最长的公共子串。我不确定我是否发现了错误或误解了 find_longest_match 的文档。这是我感到困惑的一点:

In other words, of all maximal matching blocks, return one that starts earliest in a, and of all those maximal matching blocks that start earliest in a, return the one that starts earliest in b.

(https://docs.python.org/3.5/library/difflib.html#difflib.SequenceMatcher.find_longest_match)

比较字符串X this is a testthis is a test X,子字符串X实际上是一个maximal块:它不能被扩展(即,它是包含最大的)。此外,它是文本 A 中第一个这样的最大块。但它肯定不是最长公共子串。我强烈怀疑这不是 find_longest_match 应该找到的。

事实上,在这个例子中,find_longest_match确实找到了一个最长的公共子串:

>>> l = len("X this is a test")
>>> matcher = difflib.SequenceMatcher(None, "X this is a test", "this is a test X")
>>> matcher.find_longest_match(0, l, 0, l)
Match(a=2, b=0, size=14)

然而,对于其他一些字符串,我似乎可以激发上述 "find the first maximal block"- 行为(对于长字符串很抱歉,如果我缩短它们,示例会以某种方式中断):

>>> s1 = "e-like graph visualization using a spanning tree-driven layout technique with constraints specified by layers and the ordering of groups of nodes within layers. We propose a new method of how the orde"
>>> s2 = "itree graph visualization using a spanning tree-driven layout technique with constraints speci ed by layers and the ordering of groups of nodes within layers. We propose a new method of how the drivin"
>>> matcher = difflib.SequenceMatcher(None, s1, s2)
>>> matcher.find_longest_match(1, 149, 5, 149)
Match(a=1, b=47, size=1)

在这种情况下,它将 s1[1] 中的第一个 - 匹配到 s2[47] 中的 -,仅此而已。最长的公共子串可能是以 graph visualization using…

开头的

我是不是发现了错误,或者描述此行为的文档是否有另一种可能的解释?

我在 Ubuntu 上使用 Python 3.5.2。

好的,我明白了。如果有人遇到同样的问题:SequenceMatcher 有一个 autojunk 参数会做奇怪的事情:

The heuristic counts how many times each individual item appears in the sequence. If an item’s duplicates (after the first one) account for more than 1% of the sequence and the sequence is at least 200 items long, this item is marked as “popular” and is treated as junk for the purpose of sequence matching.

据我所知,匹配器永远不会找到任何包含 "junk" 的匹配项。不知道为什么这很有用,但默认情况下它是启用的。这也解释了为什么当我缩短字符串时上面的示例会中断。然而,它确实大大加快了 LCS 搜索速度。

所以,总而言之:您可能想在构造函数中传递 autojunk=False