使用 Python 比较 CSV 并从源和目标打印出不同的行

Comparing CSV and print out the different row from both the source and the target with Python

旧的 csv 文件

Column1(列名:列名我就不比了,csv顺序不对)

AA101
BB101
CC101
DD101
EE101

新建 csv 文件

Column2(列名:列名我就不比了,csv顺序不对)

AA101
CC101
BB101
DD102
EE102

预期结果文件:

Different:
Old
DD101 (it is not in the New file)
EE101 (it is not in the New file)
New
DD102 (it is not in the Old file)
DD101 (it is not in the Old file)

我引用这个 post 并创建以下代码

import csv

Source_filename = "E:\Path\Source1.csv"
Target_filename = "E:\Path\Target1.csv"
output_filename = "E:results.csv"

# Load all the entries from Source into a set for quick lookup.
source_ids = set()

with open(Source_filename, 'r') as f:
    big_ip = csv.reader(f)
    for csv_row in big_ip:
        source_ids.add(csv_row[0])

# print source_ids

with open(Source_filename, 'r') as input_file, open(output_filename, 'w') as output_file:
    input_csv = csv.reader(input_file)
    output_csv = csv.writer(output_file)
    for csv_row in input_csv:
        ip = csv_row[0]
        status = "Present" if ip in source_ids else "Not Present"
        output_csv.writerow([ip, status + " in Source.csv"])

输出的代码与源代码既相同又不同。我只需要与源和目标不同

一种选择是使用 Pandas。有很多方法可以做到这一点,这里有一个将为您提供所有记录的完整列表,其中“指标”列设置为“两者”(如果记录出现在两个文件中),“left_only”(如果在旧文件中),或“right_only”(如果在新文件中)。有关 Pandas merge here 的更多信息:

import pandas as pd

old = pd.read_csv('old_file.csv')
new = pd.read_csv('new_file.csv')
output = old.merge(
    new,
    left_on='old_column_name',
    right_on='new_column_name',
    how='outer',
    indicator=True,
)
output.to_csv('output.csv')

您还可以在保存到 csv 之前过滤指标:

output[output['_merge'] != 'both'].to_csv('output.csv')

与 Pandas 和 pd.merge:

>>> %cat Source1.csv
AA101
BB101
CC101
DD101
EE101

>>> %cat Target1.csv
AA101
CC101
BB101
DD102
EE102
# Python env: pip install pandas
# Anaconda env: conda install pandas
import pandas as pd

source = pd.read_csv('Source1.csv', names=['big_ip'], header=None)
target = pd.read_csv('Target1.csv', names=['big_ip'], header=None)

df = pd.merge(source, target, how='outer', indicator=True)
>>> df
  big_ip      _merge
0  AA101        both  # <- present both in source and target
1  BB101        both
2  CC101        both
3  DD101   left_only  # <- present in source only (old)
4  EE101   left_only
5  DD102  right_only  # <- present in target only (new)
6  EE102  right_only

可以自定义输出以满足您的需要。