Diff python 文件，忽略行结束样式、缩进样式和尾随空格

Question

TL-DR：获取两个 python 文件的 'functionally' 差异

我正在编写一个插件框架，它将在 unix、mac 和 windows 上运行。一方面，我需要检查两个文件夹中的文件在功能上是否相同 python 代码，以便删除冗余。现在我知道“将文件 a 运行与文件 b 的结果相同”是一个既棘手又愚蠢的问题。我想要的是检查文件 a 和文件 b 是否包含相同的代码，同时忽略：

不同的行结束样式（windows "\r\n" vs unix "\n" vs mac "\r"）
尾随空格
不同的缩进样式（制表符与空格、2 空格与 4 空格等）

如果可能的话：

不同的空行
内部缩进（例如多行列表中的缩进）
时髦的缩进（混合缩进样式，例如第一级有 2 个空格，第二级有 6 个空格）

如果返回差异表示，我会更喜欢它，但是“不匹配”信息和第一个不匹配的行号就足够了。如果使用外部实用程序，它们需要在各自的系统上是标准的，或者是免费的、轻量级的和可移植的（这样我就可以将它们包含在我的可移植框架中）。

目前我运行在 python 3:

##  test if two files are the same (spare for line-endings)
def cmp_lines(path_1, path_2, skip_blanklines=True, skip_trailing=True, skip_leading=False, spaces_per_tab=4, comp_indent=False):
    l1 = l2 = True
    with open(path_1, 'rU') as f1, open(path_2, 'rU') as f2:
        ind1, ind2 = [0],[0]
        while l1 and l2:
            l1 = f1.readline()
            l2 = f2.readline()
            # ueberarbeiten: trailing whitespaces entfernen.
            if skip_trailing: l1, l2 = l1.rstrip(), l2.rstrip()
            # indentation testen (entfernt auch leading-whitespaces)
            #-  hier werden unter-indentierungen zb in mehrzeiligen listen
            #-  als normale indentierungen behandelt
            if comp_indent:
                l1b, l2b = l1.lstrip(), l2.lstrip()
                i1, i2 = l1[:len(l1)-len(l1b)], l2[:len(l2)-len(l2b)]
                ind1b = len(i1)*[1, spaces_per_tab][i1=="\t"*len(i1)]
                ind2b = len(i2)*[1, spaces_per_tab][i2=="\t"*len(i2)]
                while ind1b < ind1[-1]: ind1.pop()
                while ind2b < ind2[-1]: ind2.pop()
                if ind1[-1]<ind1b: ind1.append(ind1b)
                if ind2[-1]<ind2b: ind2.append(ind2b)
                if len(ind1)!=len(ind2): print("indentation missmatch")
                l1, l2 = l1b, l2b
            # ueberarbeiten: leading whitespaces entfernen.
            elif skip_leading: l1, l2 = l1.lstrip(), l2.lstrip()
            if l1 != l2:
                #print('a',l1,'-a',l1=='',l1=='\n',l1=='\r\n',l1=='\r')
                #print('b',l2,'-b',l2=='',l2=='\n',l2=='\r\n',l2=='\r')
                if skip_blanklines: # ueberarbeiten. kann bisher nur einen skip
                    if l1 == '\n':
                        l1b=f1.readline()
                        if l1b==l2: continue
                    if l2 == '\n':
                        l2b=f2.readline()
                        if l2b==l1: continue
                return False
    return True

这两个代码应该相等（\t 代表制表符，\r 代表 CR，\n 代表 LF）

if True:  \r\n
    \n
    print('HI')\r

if True:\n
\tprint('HI')\n

Answer 1

Get 'functionally' diff of two python files

据我所知，除了解析 python 之外没有其他方法可以做到这一点。原因是有时空间很重要，有时却不重要。看上下文，不解析就不知道上下文

例如，以下之间存在“功能”差异：

a_string = """foo
bar"""

和

a_string = """foo
     bar"""

尽管这只是缩进差异

解析应该不是你自己做的。您可能想要嵌入一个已经存在的 python 解析器。但这可能需要很多工作。

如果解析不适合您，您可能希望使用完全不关心空格的降级比较版本（类似于 diff -w 的东西）。这是我的尝试：

from collections import OrderedDict

class no_space_file_reader :
    def __init__(self, filepath):
        self.file = open(filepath)

    def all_chars(self):
        last_is_space = False
        for line in self.file.readlines():
            for char in line:
                if char in " \t\r\n":
                    if last_is_space :
                        continue
                    else:
                        last_is_space = True
                else :
                    last_is_space = False
                    yield char

a = no_space_file_reader("a.txt")
b = no_space_file_reader("b.txt")
for c_a,c_b in zip(a.all_chars(), b.all_chars()):
    if c_a != c_b:
       print("diff")

当然看不出有什么区别

a_string = m("")

和

a_string = m(" ")

这很不酷。但这是不解析的代价。

I would prefer it if a diff-representation would be returned

这也很棘手。但至少可行。这是我的全部尝试：

from collections import OrderedDict

class no_space_file_reader :
    def __init__(self, filepath):
        self.file = open(filepath)
        self.context_size = 2
        self.context = OrderedDict()

    def all_chars(self):
        last_was_space = False
        for lineid,line in enumerate(self.file.readlines()):
            self.context[lineid] = line.strip()
            if len(self.context) > self.context_size:
                self.context.popitem(last=False)
            for char in line:
                if char in " \t\r\n":
                    if last_was_space :
                        continue
                    else:
                        last_was_space = True
                else :
                    last_was_space = False
                    yield char

class diff_agglomerator :
    def __init__(self):
        self.diff = [{},{}]
        self.context_size = 2

    def append(self, contexts):
        self.diff[0].update(contexts[0])
        self.diff[1].update(contexts[1])

    def pop_and_format_diff_if_ended(self, current_lines):
        if self.is_empty():
            return ""
        last_lines = [max(self.diff[0].keys()), max(self.diff[1].keys())]
        toReturn =""
        if last_lines[0] < current_lines[0] - self.context_size and \
           last_lines[1] < current_lines[1] - self.context_size:
              toReturn = self.pop_and_format_diff()
        return toReturn

    def format_line(self, a_dict):
        return "\n".join(["{} :{}".format(k,v) for k,v in a_dict])

    def pop_and_format_diff(self):
        toReturn = ">"*5 + "\n"
        toReturn += self.format_line(self.diff[0].items()) + "\n"
        toReturn += "="*5 + "\n"
        toReturn += self.format_line(self.diff[1].items()) + "\n"
        toReturn += "<"*5 + "\n"
        self.diff = [{},{}]
        return toReturn

    def is_empty(self):
        return len(self.diff[0]) == 0 and len (self.diff[1]) == 0

def print_if_non_empty(a_string):
    if len(a_string)>0:
        print(a_string)
        
a = no_space_file_reader("a.txt")
b = no_space_file_reader("b.txt")

diff = diff_agglomerator()
for c_a,c_b in zip(a.all_chars(), b.all_chars()):
    print(c_a,c_b)
    if c_a != c_b:
        diff.append([a.context, b.context])
    else:
        first_context_lines = [min(a.context.keys()), min(b.context.keys())]
        print_if_non_empty(diff.pop_and_format_diff_if_ended(first_context_lines))
print_if_non_empty(diff.pop_and_format_diff())

它会产生那种结果：

>>>>>
4 :
5 :foo
6 :bar
=====
3 :
4 :bar  foo
<<<<<

>>>>>
18 :
19 :foo
20 :bar
=====
16 :
17 :bar foo
<<<<<

Diff python 文件，忽略行结束样式、缩进样式和尾随空格

Diff python files, ignoring line ending styles, indentation styles and trailing spaces

python

diff