Python difflib 比较两个 csv 文件并在 HTML 输出中突出显示世界水平差异

Python difflib to compare two csv files and highlight the world level differences in HTML output

我不是 Python 方面的专家,我尽力寻找答案,但找不到。请原谅,如果这是一个重复的问题,请尽可能指出正确的方向。

我正在尝试使用 Python Difflib 比较两个 CSV 文件,并将 Diff 输出生成为 HTML 页面。当前的 difflib 模块具有内置选项 -m 以通过突出显示差异来并排生成两个 csv 文件的 HTML 输出。

但是,difflib 使用 difflib.SequenceMatcher 来查找差异并使用 difflib.HtmlDiff.make_file 创建 HTML 文件。但是,它产生的输出不是我想要的。

The output I am getting currently from the difflib is :The Default Python DIFFLIB HTML output is Here.

但是,我想要的输出是:我正在寻找单词级别的突出显示,而不是在字符级别或序列突出显示中突出显示的更改。如果旧文件和新文件之间发生任何更改,我希望突出显示 WHOLE WORD

The changes that I want to highlight is: A word Level highlight of the text.

请在这方面帮助我,是否真的可以使用 difflib 还是我必须使用任何其他 tools/modules。我尝试使用 vimdiff 和其他插件,但一无所获。我对这里的任何事情都持开放态度。

我使用的代码来自 PythonDiffLib 文档页面。

import sys, os, time, difflib, optparse
  def main():
   ..
   ..
   ..
    n = options.lines //I used this n = ZERO.
    fromfile, tofile = args # as specified in the usage string

    # we're passing these as arguments to the diff function
    fromdate = time.ctime(os.stat(fromfile).st_mtime)
    todate = time.ctime(os.stat(tofile).st_mtime)
    fromlines = open(fromfile, 'U').readlines()
    tolines = open(tofile, 'U').readlines()

    diff = difflib.HtmlDiff().make_file(fromlines, tolines, fromfile,
                                            tofile, context=TRUE,
                                            numlines=0)

    # we're using writelines because diff is a generator
    sys.stdout.writelines(diff)

` Old.csv

refno,title,author,year,price
1001,CPP,MILTON,2008,456
1002,JAVA,Gilson,2002,456
1003,Adobe Flex,2010,566
1004,General Knowledge,Sinson,2007,465
1005,Actionscript,Gilto,2008,480

new.csv

refno,title,author,year,price
1001,CPP,MILTON,2010,456,2008
1002,JAVA,Gilson,2002
1003,Adobe Flexi,Johnson,2010,566
1004,General Knowledge,Simpson,2007,465
105,Action script,Gilto,2008,480
2000,Drama,DayoNe,,2020,560

我还在下面添加默认 HTML DIFF 输出和预期 HTML DIFF 输出。

Default HTML DIFF Output from DIFFLIB:

<html>

<head>
<meta http-equiv="Content-Type"
content="text/html; charset=ISO-8859-1" />
<title></title>
<style type="text/css">
table.diff {font-family:Courier; border:medium;}
.diff_header {background-color:#e0e0e0}
td.diff_header {text-align:right}
.diff_next {background-color:#c0c0c0}
.diff_add {background-color:#aaffaa}
.diff_chg {background-color:#ffff77}
.diff_sub {background-color:#ffaaaa}
</style>
</head>

<body>

<table class="diff" id="difflib_chg_to0__top"
cellspacing="0" cellpadding="0" rules="groups" >
<colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup>
<colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup>
<thead><tr><th class="diff_next"><br /></th><th colspan="2" class="diff_header">old.csv</th><th class="diff_next"><br /></th><th colspan="2" class="diff_header">new.csv</th></tr></thead>
<tbody>
<tr><td class="diff_next" id="difflib_chg_to0__0"><a href="#difflib_chg_to0__top">t</a></td><td class="diff_header" id="from0_2">2</td><td nowrap="nowrap">1001,CPP,MILTON,200<span class="diff_sub">8</span>,456</td><td class="diff_next"><a href="#difflib_chg_to0__top">t</a></td><td class="diff_header" id="to0_2">2</td><td nowrap="nowrap">1001,CPP,MILTON,20<span class="diff_add">1</span>0,456<span class="diff_add">,2008</span></td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_3">3</td><td nowrap="nowrap">1002,JAVA,Gilson,2002<span class="diff_sub">,456</span></td><td class="diff_next"></td><td class="diff_header" id="to0_3">3</td><td nowrap="nowrap">1002,JAVA,Gilson,2002</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_4">4</td><td nowrap="nowrap">1003,Adobe&nbsp;Flex,2010,566</td><td class="diff_next"></td><td class="diff_header" id="to0_4">4</td><td nowrap="nowrap">1003,Adobe&nbsp;Flex<span class="diff_add">i,Johnson</span>,2010,566</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_5">5</td><td nowrap="nowrap">1004,General&nbsp;Knowledge,Si<span class="diff_chg">n</span>son,2007,465</td><td class="diff_next"></td><td class="diff_header" id="to0_5">5</td><td nowrap="nowrap">1004,General&nbsp;Knowledge,Si<span class="diff_chg">mp</span>son,2007,465</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_6">6</td><td nowrap="nowrap">1<span class="diff_sub">0</span>05,Actionscript,Gilto,2008,480</td><td class="diff_next"></td><td class="diff_header" id="to0_6">6</td><td nowrap="nowrap">105,Action<span class="diff_add">&nbsp;</span>script,Gilto,2008,480</td></tr>
<tr><td class="diff_next"></td><td class="diff_header"></td><td nowrap="nowrap"></td><td class="diff_next"></td><td class="diff_header" id="to0_7">7</td><td nowrap="nowrap"><span class="diff_add">2000,Drama,DayoNe,,2020,560</span></td></tr>
</tbody>
</table>

</body>

</html>

Expected HTML DIFF Output from DIFFLIB:

<html>

<head>
<meta http-equiv="Content-Type"
content="text/html; charset=ISO-8859-1" />
<title></title>
<style type="text/css">
table.diff {font-family:Courier; border:medium;}
.diff_header {background-color:#e0e0e0}
td.diff_header {text-align:right}
.diff_next {background-color:#c0c0c0}
.diff_add {background-color:#aaffaa}
.diff_chg {background-color:#ffff77}
.diff_sub {background-color:#ffaaaa}
</style>
</head>

<body>

<table class="diff" id="difflib_chg_to0__top"
cellspacing="0" cellpadding="0" rules="groups" >
<colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup>
<colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup>
<thead><tr><th class="diff_next"><br /></th><th colspan="2" class="diff_header">old.csv</th><th class="diff_next"><br /></th><th colspan="2" class="diff_header">new.csv</th></tr></thead>
<tbody>
<tr><td class="diff_next" id="difflib_chg_to0__0"><a href="#difflib_chg_to0__top">t</a></td><td class="diff_header" id="from0_2">2</td><td nowrap="nowrap">1001,CPP,MILTON,<span class="diff_sub">2008</span>,456</td><td class="diff_next"><a href="#difflib_chg_to0__top">t</a></td><td class="diff_header" id="to0_2">2</td><td nowrap="nowrap">1001,CPP,MILTON,<span class="diff_add">2010</span>,456<span class="diff_add">,2008</span></td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_3">3</td><td nowrap="nowrap">1002,JAVA,Gilson,2002<span class="diff_sub">,456</span></td><td class="diff_next"></td><td class="diff_header" id="to0_3">3</td><td nowrap="nowrap">1002,JAVA,Gilson,2002</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_4">4</td><td nowrap="nowrap">1003,<span class="diff_sub">Adobe&nbsp;Flex</span>,2010,566</td><td class="diff_next"></td><td class="diff_header" id="to0_4">4</td><td nowrap="nowrap">1003,<span class="diff_add">Adobe&nbsp;Flexi</span>,<span class="diff_add">Johnson</span>,2010,566</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_5">5</td><td nowrap="nowrap">1004,General&nbsp;Knowledge,<span class="diff_sub">Sinson</span>,2007,465</td><td class="diff_next"></td><td class="diff_header" id="to0_5">5</td><td nowrap="nowrap">1004,General&nbsp;Knowledge,<span class="diff_add">Simpson</span>,2007,465</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_6">6</td><td nowrap="nowrap"><span class="diff_sub">1005</span>,<span class="diff_sub">Actionscript</span>,Gilto,2008,480</td><td class="diff_next"></td><td class="diff_header" id="to0_6">6</td><td nowrap="nowrap"><span class="diff_add">105</span>,<span class="diff_add">Action&nbsp;script</span>,Gilto,2008,480</td></tr>
<tr><td class="diff_next"></td><td class="diff_header"></td><td nowrap="nowrap"></td><td class="diff_next"></td><td class="diff_header" id="to0_7">7</td><td nowrap="nowrap"><span class="diff_add">2000,Drama,DayoNe,,2020,560</span></td></tr>
</tbody>
</table>

</body>

</html>

Question: I am looking for a word level highlight

实施 class Comma_HtmlDiff,将突出显示扩展到逗号边界:
你必须超载 difflib.ndiff.

Note: Only expand the first highlighted Part is implemented.
If difflib.ndiff highlights across Comma, this is not corrected.

class Comma_HtmlDiff(difflib.HtmlDiff):
    def __init__(self, tabsize=8, wrapcolumn=None, linejunk=None,
             charjunk=difflib.IS_CHARACTER_JUNK):
        setattr(difflib, '_ndiff', difflib.ndiff)
        setattr(difflib, 'ndiff', self.ndiff)
        super().__init__(tabsize, wrapcolumn, linejunk, charjunk)

    def ndiff(self, a, b, linejunk=None, charjunk=difflib.IS_CHARACTER_JUNK):
        _line = ''
        for line in difflib._ndiff(a, b, linejunk, charjunk):
            if line.startswith('-'):
                _d = '-'
                _line = line
            elif line.startswith('+'):
                _d = '+'
                _line = line

            if line.startswith('?'):
                dp = line.find(_d)
                if dp == -1:
                    _d = '+'
                    dp = line.find('^')
                dpl = _line.rfind(',', 0, dp)
                if dpl == -1:
                    dpl = 2
                else:
                    dpl += 1
                dpr = _line.find(',', dp)
                if dpr == dp:
                    _d = ' '
                    dpl = dp
                    dpr = dp+1

                dpw = dpr - dpl
                line = line[:dpl] + _d*dpw + line[dpr:]

            yield line

# Usage
diff = Comma_HtmlDiff().make_file(fromlines, tolines, fromfile,
                                    tofile, context=True,
                                    numlines=0)

Output:

使用 Python 测试:3.4.2