使用 difflib.diff_bytes 比较 python 中的两个文件

Using difflib.diff_bytes to compare two files in python

假设我想用 difflib.diff_bytes 函数比较文件 a 和文件 b,我该怎么做?

谢谢

在下文中,我将假设您有 Python 3.x(具体为 3.5)。
让我们分析文档以尝试理解该功能:

difflib.diff_bytes(dfunc, a, b, fromfile=b'', tofile=b'', fromfiledate=b'', tofiledate=b'', n=3, lineterm=b'\n')
Compare a and b (lists of bytes objects) using dfunc; yield a sequence of delta lines (also bytes) in the format returned by dfunc. dfunc must be a callable, typically either unified_diff() or context_diff().

Allows you to compare data with unknown or inconsistent encoding. All inputs except n must be bytes objects, not str. Works by losslessly converting all inputs (except n) to str, and calling dfunc(a, b, fromfile, tofile, fromfiledate, tofiledate, n, lineterm). The output of dfunc is then converted back to bytes, so the delta lines that you receive have the same unknown/inconsistent encodings as a and b.

首先要注意的是 bytes 对象和 str(ing) 对象之间的区别。然后除了 n 之外的每个输入参数都必须是 bytes 对象。

所以关键是你使用这个函数并将字节对象传递给它,而不是字符串。所以,如果你有一个字符串,你应该在 Python 中使用 b 前缀,这将产生一个字节类型的实例,而不是 str(ing) 类型的实例。
我建议你阅读
What does the 'b' character do in front of a string literal?
string_literals
所以我不会进一步解释那部分。
因为我发现 difflib.diff_bytes 上的文档有点神秘,所以我决定直接查看 CPython 本身用来测试该函数的代码。
这是一个很好的练习,有助于理解如何使用此功能。
用于测试 difflib.diff_bytes 的代码位于(假设您使用的是 Python 3.5)在
test_difflib

让我们检查该文件中的一个示例以了解发生了什么。

def test_byte_content(self):


 # if we receive byte strings, we return byte strings
    a = [b'hello', b'andr\xe9']     # iso-8859-1 bytes
    b = [b'hello', b'andr\xc3\xa9'] # utf-8 bytes

    unified = difflib.unified_diff
    context = difflib.context_diff

    check = self.check
    check(difflib.diff_bytes(unified, a, a))
    check(difflib.diff_bytes(unified, a, b))

    # now with filenames (content and filenames are all bytes!)
    check(difflib.diff_bytes(unified, a, a, b'a', b'a'))
    check(difflib.diff_bytes(unified, a, b, b'a', b'b'))

    # and with filenames and dates
    check(difflib.diff_bytes(unified, a, a, b'a', b'a', b'2005', b'2013'))
    check(difflib.diff_bytes(unified, a, b, b'a', b'b', b'2005', b'2013'))

    # same all over again, with context diff
    check(difflib.diff_bytes(context, a, a))
    check(difflib.diff_bytes(context, a, b))
    check(difflib.diff_bytes(context, a, a, b'a', b'a'))
    check(difflib.diff_bytes(context, a, b, b'a', b'b'))
    check(difflib.diff_bytes(context, a, a, b'a', b'a', b'2005', b'2013'))
    check(difflib.diff_bytes(context, a, b, b'a', b'b', b'2005', b'2013'))

如您所见,a 和 b 是包含每个文件内容的列表。然后程序定义了两个变量,代表函数的 dfunc 参数。还要注意 "b" 前缀。 difflib.diff_bytes 将 return 增量行作为字节对象。然后你必须编写自己的函数来检查。
其中一个示例包含在该文件中的另一个测试中,该文件还在 diff 中包含文件名:

def test_byte_filenames(self):
    # somebody renamed a file from ISO-8859-2 to UTF-8
    fna = b'\xb3odz.txt'    # "łodz.txt"
    fnb = b'\xc5\x82odz.txt'

    # they transcoded the content at the same time
    a = [b'\xa3odz is a city in Poland.']
    b = [b'\xc5\x81odz is a city in Poland.']

    check = self.check
    unified = difflib.unified_diff
    context = difflib.context_diff
    check(difflib.diff_bytes(unified, a, b, fna, fnb))
    check(difflib.diff_bytes(context, a, b, fna, fnb))

    def assertDiff(expect, actual):
        # do not compare expect and equal as lists, because unittest
        # uses difflib to report difference between lists
        actual = list(actual)
        self.assertEqual(len(expect), len(actual))
        for e, a in zip(expect, actual):
            self.assertEqual(e, a)

    expect = [
        b'--- \xb3odz.txt',
        b'+++ \xc5\x82odz.txt',
        b'@@ -1 +1 @@',
        b'-\xa3odz is a city in Poland.',
        b'+\xc5\x81odz is a city in Poland.',
    ]
    actual = difflib.diff_bytes(unified, a, b, fna, fnb, lineterm=b'')
    assertDiff(expect, actual)

正如您现在看到的,文件名作为字节对象包含在增量行中。