使用 Python 比较 CSV 并从源和目标打印出不同的行
Comparing CSV and print out the different row from both the source and the target with Python
旧的 csv 文件
Column1(列名:列名我就不比了,csv顺序不对)
AA101
BB101
CC101
DD101
EE101
新建 csv 文件
Column2(列名:列名我就不比了,csv顺序不对)
AA101
CC101
BB101
DD102
EE102
预期结果文件:
Different:
Old
DD101 (it is not in the New file)
EE101 (it is not in the New file)
New
DD102 (it is not in the Old file)
DD101 (it is not in the Old file)
我引用这个 post 并创建以下代码
import csv
Source_filename = "E:\Path\Source1.csv"
Target_filename = "E:\Path\Target1.csv"
output_filename = "E:results.csv"
# Load all the entries from Source into a set for quick lookup.
source_ids = set()
with open(Source_filename, 'r') as f:
big_ip = csv.reader(f)
for csv_row in big_ip:
source_ids.add(csv_row[0])
# print source_ids
with open(Source_filename, 'r') as input_file, open(output_filename, 'w') as output_file:
input_csv = csv.reader(input_file)
output_csv = csv.writer(output_file)
for csv_row in input_csv:
ip = csv_row[0]
status = "Present" if ip in source_ids else "Not Present"
output_csv.writerow([ip, status + " in Source.csv"])
输出的代码与源代码既相同又不同。我只需要与源和目标不同
一种选择是使用 Pandas。有很多方法可以做到这一点,这里有一个将为您提供所有记录的完整列表,其中“指标”列设置为“两者”(如果记录出现在两个文件中),“left_only”(如果在旧文件中),或“right_only”(如果在新文件中)。有关 Pandas merge here 的更多信息:
import pandas as pd
old = pd.read_csv('old_file.csv')
new = pd.read_csv('new_file.csv')
output = old.merge(
new,
left_on='old_column_name',
right_on='new_column_name',
how='outer',
indicator=True,
)
output.to_csv('output.csv')
您还可以在保存到 csv 之前过滤指标:
output[output['_merge'] != 'both'].to_csv('output.csv')
与 Pandas 和 pd.merge:
>>> %cat Source1.csv
AA101
BB101
CC101
DD101
EE101
>>> %cat Target1.csv
AA101
CC101
BB101
DD102
EE102
# Python env: pip install pandas
# Anaconda env: conda install pandas
import pandas as pd
source = pd.read_csv('Source1.csv', names=['big_ip'], header=None)
target = pd.read_csv('Target1.csv', names=['big_ip'], header=None)
df = pd.merge(source, target, how='outer', indicator=True)
>>> df
big_ip _merge
0 AA101 both # <- present both in source and target
1 BB101 both
2 CC101 both
3 DD101 left_only # <- present in source only (old)
4 EE101 left_only
5 DD102 right_only # <- present in target only (new)
6 EE102 right_only
可以自定义输出以满足您的需要。
旧的 csv 文件
Column1(列名:列名我就不比了,csv顺序不对)
AA101
BB101
CC101
DD101
EE101
新建 csv 文件
Column2(列名:列名我就不比了,csv顺序不对)
AA101
CC101
BB101
DD102
EE102
预期结果文件:
Different:
Old
DD101 (it is not in the New file)
EE101 (it is not in the New file)
New
DD102 (it is not in the Old file)
DD101 (it is not in the Old file)
我引用这个 post 并创建以下代码
import csv
Source_filename = "E:\Path\Source1.csv"
Target_filename = "E:\Path\Target1.csv"
output_filename = "E:results.csv"
# Load all the entries from Source into a set for quick lookup.
source_ids = set()
with open(Source_filename, 'r') as f:
big_ip = csv.reader(f)
for csv_row in big_ip:
source_ids.add(csv_row[0])
# print source_ids
with open(Source_filename, 'r') as input_file, open(output_filename, 'w') as output_file:
input_csv = csv.reader(input_file)
output_csv = csv.writer(output_file)
for csv_row in input_csv:
ip = csv_row[0]
status = "Present" if ip in source_ids else "Not Present"
output_csv.writerow([ip, status + " in Source.csv"])
输出的代码与源代码既相同又不同。我只需要与源和目标不同
一种选择是使用 Pandas。有很多方法可以做到这一点,这里有一个将为您提供所有记录的完整列表,其中“指标”列设置为“两者”(如果记录出现在两个文件中),“left_only”(如果在旧文件中),或“right_only”(如果在新文件中)。有关 Pandas merge here 的更多信息:
import pandas as pd
old = pd.read_csv('old_file.csv')
new = pd.read_csv('new_file.csv')
output = old.merge(
new,
left_on='old_column_name',
right_on='new_column_name',
how='outer',
indicator=True,
)
output.to_csv('output.csv')
您还可以在保存到 csv 之前过滤指标:
output[output['_merge'] != 'both'].to_csv('output.csv')
与 Pandas 和 pd.merge:
>>> %cat Source1.csv
AA101
BB101
CC101
DD101
EE101
>>> %cat Target1.csv
AA101
CC101
BB101
DD102
EE102
# Python env: pip install pandas
# Anaconda env: conda install pandas
import pandas as pd
source = pd.read_csv('Source1.csv', names=['big_ip'], header=None)
target = pd.read_csv('Target1.csv', names=['big_ip'], header=None)
df = pd.merge(source, target, how='outer', indicator=True)
>>> df
big_ip _merge
0 AA101 both # <- present both in source and target
1 BB101 both
2 CC101 both
3 DD101 left_only # <- present in source only (old)
4 EE101 left_only
5 DD102 right_only # <- present in target only (new)
6 EE102 right_only
可以自定义输出以满足您的需要。