如何比较两个文件中有变化的 csv 文件和输出行?
How to compare csv files and output lines from both files where there are changes?
我是 python 的新手,我正在尝试比较 2 个包含大部分相同信息但有些行已完全删除、新增或仅更改了 1 个值的 csv 文件。我需要一个输出文件,该文件只有在有更改的情况下才具有来自先前和当前 csv 文件的完整行。我还需要在最前面添加一列,并根据它们来自哪个文件(以前的或当前的)来标记这些行。
我试过使用 difflib 中的 HtmlDiff,但这并没有以我想要的格式提供信息,而且它还显示了所有未更改的信息。我也尝试了 csv.reader 和 diff_rows 但那是一场灾难。
最接近我的结果是下面的结果,但在它输出的组合文件中,我无法知道哪一行来自哪个文件,因为它没有标签。尽量不要嘲笑我的代码;我敢肯定有更好的方法可以做到这一点,但我自己无法弄清楚,非常感谢您的帮助。
如果我第二次没有定义之前和当前,那么移除输出为空。
previous = open('2019-08-21.csv', 'r', encoding="utf8")
current = open('2019-08-27.csv', 'r', encoding="utf8")
additions = set(current) - set(previous)
with open('Additions Aug 2019.csv', 'w', encoding="utf8") as file_out:
for line in additions:
file_out.write(line)
previous = open('2019-08-21.csv', 'r', encoding="utf8")
current = open('2019-08-27.csv', 'r', encoding="utf8")
removals = set(previous) - set(current)
with open('Removals Aug 2019.csv', 'w', encoding="utf8") as file_out:
for line in removals:
file_out.write(line)
filenames = ['Additions Aug 2019.csv', 'Removals Aug 2019.csv']
with open('Add, Rem Aug 2019.csv', 'w', encoding="utf8") as outfile:
for fname in filenames:
with open(fname) as infile:
for line in infile:
outfile.write(line)
previous.close()
current.close()
file_out.close()
我设法找到了 pandas 的解决方案,并将分享给可能需要的其他人。
import pandas as pd
previous = open('2019-08-21.csv', 'r', encoding="utf8")
current = open('2019-08-27.csv', 'r', encoding="utf8")
df_p = pd.read_csv(previous)
df_p.drop(['Middle Name', 'Date of Birth', 'Place of Birth'], axis=1, inplace=True)
df_p.insert(0, 'Change Type', "Removed")
df_c = pd.read_csv(current)
df_c.drop(['Middle Name', 'Date of Birth', 'Place of Birth'], axis=1, inplace=True)
df_c.insert(0, 'Change Type', "Added")
df_f = df_c.append(df_p)
df_dedup = df_f.drop_duplicates(subset=['Full Name', 'Country', 'Position'], keep=False)
with open('Aug 2019 Changes.csv', 'w', encoding='utf8') as file_out:
df_dedup.to_csv(file_out, index=False)
我是 python 的新手,我正在尝试比较 2 个包含大部分相同信息但有些行已完全删除、新增或仅更改了 1 个值的 csv 文件。我需要一个输出文件,该文件只有在有更改的情况下才具有来自先前和当前 csv 文件的完整行。我还需要在最前面添加一列,并根据它们来自哪个文件(以前的或当前的)来标记这些行。
我试过使用 difflib 中的 HtmlDiff,但这并没有以我想要的格式提供信息,而且它还显示了所有未更改的信息。我也尝试了 csv.reader 和 diff_rows 但那是一场灾难。
最接近我的结果是下面的结果,但在它输出的组合文件中,我无法知道哪一行来自哪个文件,因为它没有标签。尽量不要嘲笑我的代码;我敢肯定有更好的方法可以做到这一点,但我自己无法弄清楚,非常感谢您的帮助。
如果我第二次没有定义之前和当前,那么移除输出为空。
previous = open('2019-08-21.csv', 'r', encoding="utf8")
current = open('2019-08-27.csv', 'r', encoding="utf8")
additions = set(current) - set(previous)
with open('Additions Aug 2019.csv', 'w', encoding="utf8") as file_out:
for line in additions:
file_out.write(line)
previous = open('2019-08-21.csv', 'r', encoding="utf8")
current = open('2019-08-27.csv', 'r', encoding="utf8")
removals = set(previous) - set(current)
with open('Removals Aug 2019.csv', 'w', encoding="utf8") as file_out:
for line in removals:
file_out.write(line)
filenames = ['Additions Aug 2019.csv', 'Removals Aug 2019.csv']
with open('Add, Rem Aug 2019.csv', 'w', encoding="utf8") as outfile:
for fname in filenames:
with open(fname) as infile:
for line in infile:
outfile.write(line)
previous.close()
current.close()
file_out.close()
我设法找到了 pandas 的解决方案,并将分享给可能需要的其他人。
import pandas as pd
previous = open('2019-08-21.csv', 'r', encoding="utf8")
current = open('2019-08-27.csv', 'r', encoding="utf8")
df_p = pd.read_csv(previous)
df_p.drop(['Middle Name', 'Date of Birth', 'Place of Birth'], axis=1, inplace=True)
df_p.insert(0, 'Change Type', "Removed")
df_c = pd.read_csv(current)
df_c.drop(['Middle Name', 'Date of Birth', 'Place of Birth'], axis=1, inplace=True)
df_c.insert(0, 'Change Type', "Added")
df_f = df_c.append(df_p)
df_dedup = df_f.drop_duplicates(subset=['Full Name', 'Country', 'Position'], keep=False)
with open('Aug 2019 Changes.csv', 'w', encoding='utf8') as file_out:
df_dedup.to_csv(file_out, index=False)