Python difflib 比较两个 csv 文件并在 HTML 输出中突出显示世界水平差异
Python difflib to compare two csv files and highlight the world level differences in HTML output
我不是 Python 方面的专家,我尽力寻找答案,但找不到。请原谅,如果这是一个重复的问题,请尽可能指出正确的方向。
我正在尝试使用 Python Difflib 比较两个 CSV 文件,并将 Diff 输出生成为 HTML 页面。当前的 difflib 模块具有内置选项 -m 以通过突出显示差异来并排生成两个 csv 文件的 HTML 输出。
但是,difflib 使用 difflib.SequenceMatcher
来查找差异并使用 difflib.HtmlDiff.make_file
创建 HTML 文件。但是,它产生的输出不是我想要的。
The output I am getting currently from the difflib is :The Default Python DIFFLIB HTML output is Here.
但是,我想要的输出是:我正在寻找单词级别的突出显示,而不是在字符级别或序列突出显示中突出显示的更改。如果旧文件和新文件之间发生任何更改,我希望突出显示 WHOLE WORD。
The changes that I want to highlight is: A word Level highlight of the text.
请在这方面帮助我,是否真的可以使用 difflib 还是我必须使用任何其他 tools/modules。我尝试使用 vimdiff 和其他插件,但一无所获。我对这里的任何事情都持开放态度。
我使用的代码来自 PythonDiffLib
文档页面。
import sys, os, time, difflib, optparse
def main():
..
..
..
n = options.lines //I used this n = ZERO.
fromfile, tofile = args # as specified in the usage string
# we're passing these as arguments to the diff function
fromdate = time.ctime(os.stat(fromfile).st_mtime)
todate = time.ctime(os.stat(tofile).st_mtime)
fromlines = open(fromfile, 'U').readlines()
tolines = open(tofile, 'U').readlines()
diff = difflib.HtmlDiff().make_file(fromlines, tolines, fromfile,
tofile, context=TRUE,
numlines=0)
# we're using writelines because diff is a generator
sys.stdout.writelines(diff)
`
Old.csv
refno,title,author,year,price
1001,CPP,MILTON,2008,456
1002,JAVA,Gilson,2002,456
1003,Adobe Flex,2010,566
1004,General Knowledge,Sinson,2007,465
1005,Actionscript,Gilto,2008,480
new.csv
refno,title,author,year,price
1001,CPP,MILTON,2010,456,2008
1002,JAVA,Gilson,2002
1003,Adobe Flexi,Johnson,2010,566
1004,General Knowledge,Simpson,2007,465
105,Action script,Gilto,2008,480
2000,Drama,DayoNe,,2020,560
我还在下面添加默认 HTML DIFF 输出和预期 HTML DIFF 输出。
Default HTML DIFF Output from DIFFLIB:
<html>
<head>
<meta http-equiv="Content-Type"
content="text/html; charset=ISO-8859-1" />
<title></title>
<style type="text/css">
table.diff {font-family:Courier; border:medium;}
.diff_header {background-color:#e0e0e0}
td.diff_header {text-align:right}
.diff_next {background-color:#c0c0c0}
.diff_add {background-color:#aaffaa}
.diff_chg {background-color:#ffff77}
.diff_sub {background-color:#ffaaaa}
</style>
</head>
<body>
<table class="diff" id="difflib_chg_to0__top"
cellspacing="0" cellpadding="0" rules="groups" >
<colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup>
<colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup>
<thead><tr><th class="diff_next"><br /></th><th colspan="2" class="diff_header">old.csv</th><th class="diff_next"><br /></th><th colspan="2" class="diff_header">new.csv</th></tr></thead>
<tbody>
<tr><td class="diff_next" id="difflib_chg_to0__0"><a href="#difflib_chg_to0__top">t</a></td><td class="diff_header" id="from0_2">2</td><td nowrap="nowrap">1001,CPP,MILTON,200<span class="diff_sub">8</span>,456</td><td class="diff_next"><a href="#difflib_chg_to0__top">t</a></td><td class="diff_header" id="to0_2">2</td><td nowrap="nowrap">1001,CPP,MILTON,20<span class="diff_add">1</span>0,456<span class="diff_add">,2008</span></td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_3">3</td><td nowrap="nowrap">1002,JAVA,Gilson,2002<span class="diff_sub">,456</span></td><td class="diff_next"></td><td class="diff_header" id="to0_3">3</td><td nowrap="nowrap">1002,JAVA,Gilson,2002</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_4">4</td><td nowrap="nowrap">1003,Adobe Flex,2010,566</td><td class="diff_next"></td><td class="diff_header" id="to0_4">4</td><td nowrap="nowrap">1003,Adobe Flex<span class="diff_add">i,Johnson</span>,2010,566</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_5">5</td><td nowrap="nowrap">1004,General Knowledge,Si<span class="diff_chg">n</span>son,2007,465</td><td class="diff_next"></td><td class="diff_header" id="to0_5">5</td><td nowrap="nowrap">1004,General Knowledge,Si<span class="diff_chg">mp</span>son,2007,465</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_6">6</td><td nowrap="nowrap">1<span class="diff_sub">0</span>05,Actionscript,Gilto,2008,480</td><td class="diff_next"></td><td class="diff_header" id="to0_6">6</td><td nowrap="nowrap">105,Action<span class="diff_add"> </span>script,Gilto,2008,480</td></tr>
<tr><td class="diff_next"></td><td class="diff_header"></td><td nowrap="nowrap"></td><td class="diff_next"></td><td class="diff_header" id="to0_7">7</td><td nowrap="nowrap"><span class="diff_add">2000,Drama,DayoNe,,2020,560</span></td></tr>
</tbody>
</table>
</body>
</html>
Expected HTML DIFF Output from DIFFLIB:
<html>
<head>
<meta http-equiv="Content-Type"
content="text/html; charset=ISO-8859-1" />
<title></title>
<style type="text/css">
table.diff {font-family:Courier; border:medium;}
.diff_header {background-color:#e0e0e0}
td.diff_header {text-align:right}
.diff_next {background-color:#c0c0c0}
.diff_add {background-color:#aaffaa}
.diff_chg {background-color:#ffff77}
.diff_sub {background-color:#ffaaaa}
</style>
</head>
<body>
<table class="diff" id="difflib_chg_to0__top"
cellspacing="0" cellpadding="0" rules="groups" >
<colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup>
<colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup>
<thead><tr><th class="diff_next"><br /></th><th colspan="2" class="diff_header">old.csv</th><th class="diff_next"><br /></th><th colspan="2" class="diff_header">new.csv</th></tr></thead>
<tbody>
<tr><td class="diff_next" id="difflib_chg_to0__0"><a href="#difflib_chg_to0__top">t</a></td><td class="diff_header" id="from0_2">2</td><td nowrap="nowrap">1001,CPP,MILTON,<span class="diff_sub">2008</span>,456</td><td class="diff_next"><a href="#difflib_chg_to0__top">t</a></td><td class="diff_header" id="to0_2">2</td><td nowrap="nowrap">1001,CPP,MILTON,<span class="diff_add">2010</span>,456<span class="diff_add">,2008</span></td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_3">3</td><td nowrap="nowrap">1002,JAVA,Gilson,2002<span class="diff_sub">,456</span></td><td class="diff_next"></td><td class="diff_header" id="to0_3">3</td><td nowrap="nowrap">1002,JAVA,Gilson,2002</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_4">4</td><td nowrap="nowrap">1003,<span class="diff_sub">Adobe Flex</span>,2010,566</td><td class="diff_next"></td><td class="diff_header" id="to0_4">4</td><td nowrap="nowrap">1003,<span class="diff_add">Adobe Flexi</span>,<span class="diff_add">Johnson</span>,2010,566</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_5">5</td><td nowrap="nowrap">1004,General Knowledge,<span class="diff_sub">Sinson</span>,2007,465</td><td class="diff_next"></td><td class="diff_header" id="to0_5">5</td><td nowrap="nowrap">1004,General Knowledge,<span class="diff_add">Simpson</span>,2007,465</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_6">6</td><td nowrap="nowrap"><span class="diff_sub">1005</span>,<span class="diff_sub">Actionscript</span>,Gilto,2008,480</td><td class="diff_next"></td><td class="diff_header" id="to0_6">6</td><td nowrap="nowrap"><span class="diff_add">105</span>,<span class="diff_add">Action script</span>,Gilto,2008,480</td></tr>
<tr><td class="diff_next"></td><td class="diff_header"></td><td nowrap="nowrap"></td><td class="diff_next"></td><td class="diff_header" id="to0_7">7</td><td nowrap="nowrap"><span class="diff_add">2000,Drama,DayoNe,,2020,560</span></td></tr>
</tbody>
</table>
</body>
</html>
Question: I am looking for a word level highlight
实施 class Comma_HtmlDiff
,将突出显示扩展到逗号边界:
你必须超载 difflib.ndiff
.
Note: Only expand the first highlighted Part is implemented.
If difflib.ndiff
highlights across Comma, this is not corrected.
class Comma_HtmlDiff(difflib.HtmlDiff):
def __init__(self, tabsize=8, wrapcolumn=None, linejunk=None,
charjunk=difflib.IS_CHARACTER_JUNK):
setattr(difflib, '_ndiff', difflib.ndiff)
setattr(difflib, 'ndiff', self.ndiff)
super().__init__(tabsize, wrapcolumn, linejunk, charjunk)
def ndiff(self, a, b, linejunk=None, charjunk=difflib.IS_CHARACTER_JUNK):
_line = ''
for line in difflib._ndiff(a, b, linejunk, charjunk):
if line.startswith('-'):
_d = '-'
_line = line
elif line.startswith('+'):
_d = '+'
_line = line
if line.startswith('?'):
dp = line.find(_d)
if dp == -1:
_d = '+'
dp = line.find('^')
dpl = _line.rfind(',', 0, dp)
if dpl == -1:
dpl = 2
else:
dpl += 1
dpr = _line.find(',', dp)
if dpr == dp:
_d = ' '
dpl = dp
dpr = dp+1
dpw = dpr - dpl
line = line[:dpl] + _d*dpw + line[dpr:]
yield line
# Usage
diff = Comma_HtmlDiff().make_file(fromlines, tolines, fromfile,
tofile, context=True,
numlines=0)
Output:
使用 Python 测试:3.4.2
我不是 Python 方面的专家,我尽力寻找答案,但找不到。请原谅,如果这是一个重复的问题,请尽可能指出正确的方向。
我正在尝试使用 Python Difflib 比较两个 CSV 文件,并将 Diff 输出生成为 HTML 页面。当前的 difflib 模块具有内置选项 -m 以通过突出显示差异来并排生成两个 csv 文件的 HTML 输出。
但是,difflib 使用 difflib.SequenceMatcher
来查找差异并使用 difflib.HtmlDiff.make_file
创建 HTML 文件。但是,它产生的输出不是我想要的。
The output I am getting currently from the difflib is :The Default Python DIFFLIB HTML output is Here.
但是,我想要的输出是:我正在寻找单词级别的突出显示,而不是在字符级别或序列突出显示中突出显示的更改。如果旧文件和新文件之间发生任何更改,我希望突出显示 WHOLE WORD。
The changes that I want to highlight is: A word Level highlight of the text.
请在这方面帮助我,是否真的可以使用 difflib 还是我必须使用任何其他 tools/modules。我尝试使用 vimdiff 和其他插件,但一无所获。我对这里的任何事情都持开放态度。
我使用的代码来自 PythonDiffLib
文档页面。
import sys, os, time, difflib, optparse
def main():
..
..
..
n = options.lines //I used this n = ZERO.
fromfile, tofile = args # as specified in the usage string
# we're passing these as arguments to the diff function
fromdate = time.ctime(os.stat(fromfile).st_mtime)
todate = time.ctime(os.stat(tofile).st_mtime)
fromlines = open(fromfile, 'U').readlines()
tolines = open(tofile, 'U').readlines()
diff = difflib.HtmlDiff().make_file(fromlines, tolines, fromfile,
tofile, context=TRUE,
numlines=0)
# we're using writelines because diff is a generator
sys.stdout.writelines(diff)
` Old.csv
refno,title,author,year,price
1001,CPP,MILTON,2008,456
1002,JAVA,Gilson,2002,456
1003,Adobe Flex,2010,566
1004,General Knowledge,Sinson,2007,465
1005,Actionscript,Gilto,2008,480
new.csv
refno,title,author,year,price
1001,CPP,MILTON,2010,456,2008
1002,JAVA,Gilson,2002
1003,Adobe Flexi,Johnson,2010,566
1004,General Knowledge,Simpson,2007,465
105,Action script,Gilto,2008,480
2000,Drama,DayoNe,,2020,560
我还在下面添加默认 HTML DIFF 输出和预期 HTML DIFF 输出。
Default HTML DIFF Output from DIFFLIB:
<html>
<head>
<meta http-equiv="Content-Type"
content="text/html; charset=ISO-8859-1" />
<title></title>
<style type="text/css">
table.diff {font-family:Courier; border:medium;}
.diff_header {background-color:#e0e0e0}
td.diff_header {text-align:right}
.diff_next {background-color:#c0c0c0}
.diff_add {background-color:#aaffaa}
.diff_chg {background-color:#ffff77}
.diff_sub {background-color:#ffaaaa}
</style>
</head>
<body>
<table class="diff" id="difflib_chg_to0__top"
cellspacing="0" cellpadding="0" rules="groups" >
<colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup>
<colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup>
<thead><tr><th class="diff_next"><br /></th><th colspan="2" class="diff_header">old.csv</th><th class="diff_next"><br /></th><th colspan="2" class="diff_header">new.csv</th></tr></thead>
<tbody>
<tr><td class="diff_next" id="difflib_chg_to0__0"><a href="#difflib_chg_to0__top">t</a></td><td class="diff_header" id="from0_2">2</td><td nowrap="nowrap">1001,CPP,MILTON,200<span class="diff_sub">8</span>,456</td><td class="diff_next"><a href="#difflib_chg_to0__top">t</a></td><td class="diff_header" id="to0_2">2</td><td nowrap="nowrap">1001,CPP,MILTON,20<span class="diff_add">1</span>0,456<span class="diff_add">,2008</span></td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_3">3</td><td nowrap="nowrap">1002,JAVA,Gilson,2002<span class="diff_sub">,456</span></td><td class="diff_next"></td><td class="diff_header" id="to0_3">3</td><td nowrap="nowrap">1002,JAVA,Gilson,2002</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_4">4</td><td nowrap="nowrap">1003,Adobe Flex,2010,566</td><td class="diff_next"></td><td class="diff_header" id="to0_4">4</td><td nowrap="nowrap">1003,Adobe Flex<span class="diff_add">i,Johnson</span>,2010,566</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_5">5</td><td nowrap="nowrap">1004,General Knowledge,Si<span class="diff_chg">n</span>son,2007,465</td><td class="diff_next"></td><td class="diff_header" id="to0_5">5</td><td nowrap="nowrap">1004,General Knowledge,Si<span class="diff_chg">mp</span>son,2007,465</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_6">6</td><td nowrap="nowrap">1<span class="diff_sub">0</span>05,Actionscript,Gilto,2008,480</td><td class="diff_next"></td><td class="diff_header" id="to0_6">6</td><td nowrap="nowrap">105,Action<span class="diff_add"> </span>script,Gilto,2008,480</td></tr>
<tr><td class="diff_next"></td><td class="diff_header"></td><td nowrap="nowrap"></td><td class="diff_next"></td><td class="diff_header" id="to0_7">7</td><td nowrap="nowrap"><span class="diff_add">2000,Drama,DayoNe,,2020,560</span></td></tr>
</tbody>
</table>
</body>
</html>
Expected HTML DIFF Output from DIFFLIB:
<html>
<head>
<meta http-equiv="Content-Type"
content="text/html; charset=ISO-8859-1" />
<title></title>
<style type="text/css">
table.diff {font-family:Courier; border:medium;}
.diff_header {background-color:#e0e0e0}
td.diff_header {text-align:right}
.diff_next {background-color:#c0c0c0}
.diff_add {background-color:#aaffaa}
.diff_chg {background-color:#ffff77}
.diff_sub {background-color:#ffaaaa}
</style>
</head>
<body>
<table class="diff" id="difflib_chg_to0__top"
cellspacing="0" cellpadding="0" rules="groups" >
<colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup>
<colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup>
<thead><tr><th class="diff_next"><br /></th><th colspan="2" class="diff_header">old.csv</th><th class="diff_next"><br /></th><th colspan="2" class="diff_header">new.csv</th></tr></thead>
<tbody>
<tr><td class="diff_next" id="difflib_chg_to0__0"><a href="#difflib_chg_to0__top">t</a></td><td class="diff_header" id="from0_2">2</td><td nowrap="nowrap">1001,CPP,MILTON,<span class="diff_sub">2008</span>,456</td><td class="diff_next"><a href="#difflib_chg_to0__top">t</a></td><td class="diff_header" id="to0_2">2</td><td nowrap="nowrap">1001,CPP,MILTON,<span class="diff_add">2010</span>,456<span class="diff_add">,2008</span></td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_3">3</td><td nowrap="nowrap">1002,JAVA,Gilson,2002<span class="diff_sub">,456</span></td><td class="diff_next"></td><td class="diff_header" id="to0_3">3</td><td nowrap="nowrap">1002,JAVA,Gilson,2002</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_4">4</td><td nowrap="nowrap">1003,<span class="diff_sub">Adobe Flex</span>,2010,566</td><td class="diff_next"></td><td class="diff_header" id="to0_4">4</td><td nowrap="nowrap">1003,<span class="diff_add">Adobe Flexi</span>,<span class="diff_add">Johnson</span>,2010,566</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_5">5</td><td nowrap="nowrap">1004,General Knowledge,<span class="diff_sub">Sinson</span>,2007,465</td><td class="diff_next"></td><td class="diff_header" id="to0_5">5</td><td nowrap="nowrap">1004,General Knowledge,<span class="diff_add">Simpson</span>,2007,465</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_6">6</td><td nowrap="nowrap"><span class="diff_sub">1005</span>,<span class="diff_sub">Actionscript</span>,Gilto,2008,480</td><td class="diff_next"></td><td class="diff_header" id="to0_6">6</td><td nowrap="nowrap"><span class="diff_add">105</span>,<span class="diff_add">Action script</span>,Gilto,2008,480</td></tr>
<tr><td class="diff_next"></td><td class="diff_header"></td><td nowrap="nowrap"></td><td class="diff_next"></td><td class="diff_header" id="to0_7">7</td><td nowrap="nowrap"><span class="diff_add">2000,Drama,DayoNe,,2020,560</span></td></tr>
</tbody>
</table>
</body>
</html>
Question: I am looking for a word level highlight
实施 class Comma_HtmlDiff
,将突出显示扩展到逗号边界:
你必须超载 difflib.ndiff
.
Note: Only expand the first highlighted Part is implemented.
Ifdifflib.ndiff
highlights across Comma, this is not corrected.
class Comma_HtmlDiff(difflib.HtmlDiff):
def __init__(self, tabsize=8, wrapcolumn=None, linejunk=None,
charjunk=difflib.IS_CHARACTER_JUNK):
setattr(difflib, '_ndiff', difflib.ndiff)
setattr(difflib, 'ndiff', self.ndiff)
super().__init__(tabsize, wrapcolumn, linejunk, charjunk)
def ndiff(self, a, b, linejunk=None, charjunk=difflib.IS_CHARACTER_JUNK):
_line = ''
for line in difflib._ndiff(a, b, linejunk, charjunk):
if line.startswith('-'):
_d = '-'
_line = line
elif line.startswith('+'):
_d = '+'
_line = line
if line.startswith('?'):
dp = line.find(_d)
if dp == -1:
_d = '+'
dp = line.find('^')
dpl = _line.rfind(',', 0, dp)
if dpl == -1:
dpl = 2
else:
dpl += 1
dpr = _line.find(',', dp)
if dpr == dp:
_d = ' '
dpl = dp
dpr = dp+1
dpw = dpr - dpl
line = line[:dpl] + _d*dpw + line[dpr:]
yield line
# Usage
diff = Comma_HtmlDiff().make_file(fromlines, tolines, fromfile,
tofile, context=True,
numlines=0)
Output:
使用 Python 测试:3.4.2