在非常不同的字符串上强制使用 ndiff

Question

difflib 中的 ndiff 函数提供了一个很好的界面来检测行中的差异。当线条足够接近时，它会做得很好：

>>> print '\n'.join(list(ndiff(['foo*'], ['foot'], )))
- foo*
?    ^

+ foot
?    ^

但是当线条太不相似时，丰富的报告就不再可能了：

>>> print '\n'.join(list(ndiff(['foo'], ['foo*****'], )))
- foo
+ foo*****

这是我遇到的用例，我正在尝试寻找使用 ndiff（或底层 class Differ）强制报告的方法，即使字符串太不相似了。

对于失败的示例，我希望得到如下结果：

>>> print '\n'.join(list(ndiff(['foo'], ['foo*****'], )))
- foo
+ foo*****
?    +++++

Answer 1

看来你想在这里做的不是跨行比较，而是跨字符串比较。然后您可以直接传递您的字符串，而无需列表，您应该会得到接近您正在寻找的行为。

>>> print ('\n'.join(list(ndiff('foo', 'foo*****'))))
  f
  o
  o
+ *
+ *
+ *
+ *
+ *

即使输出格式与您要查找的格式不完全相同，但它封装了正确的信息。我们可以制作一个输出适配器来提供正确的格式。

def adapter(out):
    chars = []
    symbols = []

    for c in out:
        chars.append(c[2])
        symbols.append(c[0])

    return ''.join(chars), ''.join(symbols)

这个可以这么用

>>> print ('\n'.join(adapter(ndiff('foo', 'foo*****'))))
foo*****
   +++++

Answer 2

负责打印上下文（即那些以 ? 开头的行）的函数是 Differ._fancy_replace。该函数的工作原理是检查两条线是否至少相等 75%（请参阅 cutoff 变量）。不幸的是，75% 的截止值是 hard-coded，无法更改。

我可以建议的是继承 Differ 并提供 _fancy_replace 的一个版本，它只是忽略了截止。在这里：

from difflib import Differ, SequenceMatcher

class FullContextDiffer(Differ):

    def _fancy_replace(self, a, alo, ahi, b, blo, bhi):
        """
        Copied and adapted from https://github.com/python/cpython/blob/3.6/Lib/difflib.py#L928
        """
        best_ratio = 0
        cruncher = SequenceMatcher(self.charjunk)

        for j in range(blo, bhi):
            bj = b[j]
            cruncher.set_seq2(bj)
            for i in range(alo, ahi):
                ai = a[i]
                if ai == bj:
                    continue
                cruncher.set_seq1(ai)
                if cruncher.real_quick_ratio() > best_ratio and \
                      cruncher.quick_ratio() > best_ratio and \
                      cruncher.ratio() > best_ratio:
                    best_ratio, best_i, best_j = cruncher.ratio(), i, j

        yield from self._fancy_helper(a, alo, best_i, b, blo, best_j)

        aelt, belt = a[best_i], b[best_j]

        atags = btags = ""
        cruncher.set_seqs(aelt, belt)
        for tag, ai1, ai2, bj1, bj2 in cruncher.get_opcodes():
            la, lb = ai2 - ai1, bj2 - bj1
            if tag == 'replace':
                atags += '^' * la
                btags += '^' * lb
            elif tag == 'delete':
                atags += '-' * la
            elif tag == 'insert':
                btags += '+' * lb
            elif tag == 'equal':
                atags += ' ' * la
                btags += ' ' * lb
            else:
                raise ValueError('unknown tag %r' % (tag,))
        yield from self._qformat(aelt, belt, atags, btags)

        yield from self._fancy_helper(a, best_i+1, ahi, b, best_j+1, bhi)

下面是它如何工作的一个例子：

a = [
    'foo',
    'bar',
    'foobar',
]

b = [
    'foo',
    'bar',
    'barfoo',
]

print('\n'.join(FullContextDiffer().compare(a, b)))

# Output:
# 
#   foo
#   bar
# - foobar
# ?    ---
# 
# + barfoo
# ? +++

在非常不同的字符串上强制使用 ndiff

forcing ndiff on very dissimilar strings

python

difflib