如何比较两个文件中有变化的 csv 文件和输出行？

Question

我是 python 的新手，我正在尝试比较 2 个包含大部分相同信息但有些行已完全删除、新增或仅更改了 1 个值的 csv 文件。我需要一个输出文件，该文件只有在有更改的情况下才具有来自先前和当前 csv 文件的完整行。我还需要在最前面添加一列，并根据它们来自哪个文件（以前的或当前的）来标记这些行。

我试过使用 difflib 中的 HtmlDiff，但这并没有以我想要的格式提供信息，而且它还显示了所有未更改的信息。我也尝试了 csv.reader 和 diff_rows 但那是一场灾难。

最接近我的结果是下面的结果，但在它输出的组合文件中，我无法知道哪一行来自哪个文件，因为它没有标签。尽量不要嘲笑我的代码；我敢肯定有更好的方法可以做到这一点，但我自己无法弄清楚，非常感谢您的帮助。

如果我第二次没有定义之前和当前，那么移除输出为空。

previous = open('2019-08-21.csv', 'r', encoding="utf8")
current = open('2019-08-27.csv', 'r', encoding="utf8")

additions = set(current) - set(previous)

with open('Additions Aug 2019.csv', 'w', encoding="utf8") as file_out:
    for line in additions:
        file_out.write(line)

previous = open('2019-08-21.csv', 'r', encoding="utf8")
current = open('2019-08-27.csv', 'r', encoding="utf8")

removals = set(previous) - set(current)

with open('Removals Aug 2019.csv', 'w', encoding="utf8") as file_out:
    for line in removals:
        file_out.write(line)

filenames = ['Additions Aug 2019.csv', 'Removals Aug 2019.csv']
with open('Add, Rem Aug 2019.csv', 'w', encoding="utf8") as outfile:
    for fname in filenames:
        with open(fname) as infile:
            for line in infile:
                outfile.write(line)

previous.close()
current.close()
file_out.close()

Answer 1

我设法找到了 pandas 的解决方案，并将分享给可能需要的其他人。

import pandas as pd

previous = open('2019-08-21.csv', 'r', encoding="utf8")
current = open('2019-08-27.csv', 'r', encoding="utf8")

df_p = pd.read_csv(previous)
df_p.drop(['Middle Name', 'Date of Birth', 'Place of Birth'], axis=1, inplace=True)
df_p.insert(0, 'Change Type', "Removed")

df_c = pd.read_csv(current)
df_c.drop(['Middle Name', 'Date of Birth', 'Place of Birth'], axis=1, inplace=True)
df_c.insert(0, 'Change Type', "Added")

df_f = df_c.append(df_p)
df_dedup = df_f.drop_duplicates(subset=['Full Name', 'Country', 'Position'], keep=False)

with open('Aug 2019 Changes.csv', 'w', encoding='utf8') as file_out:
    df_dedup.to_csv(file_out, index=False)

如何比较两个文件中有变化的 csv 文件和输出行？

How to compare csv files and output lines from both files where there are changes?

python

csv

comparison