如何将 Python 中两个 CSV 文件的结果与 DictReader 或 Pandas 进行比较? (也可以使用任何其他方法!)

How can I compare results from two CSV files in Python with DictReader or Pandas? (Open to any other methods as well!)

我正在尝试比较两个 CSV 文件的结果。我希望它检查 CSV1 中的第一列 (ID_NUMBER) 是否与 CSV2 匹配,然后我希望它检查其余列中的值是否也匹配,并给出 "True" 的输出或 "False" 对应的列。最后,我想添加一列;如果所有行值都是 "True" 它应该是 True,如果有一个 "False" 它应该是 false 这是我的代码:

import csv
import os
import sys
import difflib

def compare():

list_csv1 = []
with open('query1_results.csv' , 'rb') as query_results1:
    reader1 = csv.DictReader(query_results1)
    for row1 in reader1:
        list_csv1.append(row1)

list_csv2= []   
with open('query2_results.csv' , 'rb') as query_results2:
    reader2 = csv.DictReader(query_results2)
    for row2 in reader2:
        list_csv2.append(row2)

CSV1:

{'ID_NUMBER': '1J2Wh', 'CHECKING_CODE': '20', 'SECURITY_CODE': '0', 'JOB_ID': ''}
{'ID_NUMBER': '124ggfh', 'CHECKING_CODE': '100', 'SECURITY_CODE': '0', 'JOB_ID': 'O'}

CSV2:

{'ID_NUMBER': '1J2Wh', 'CHECKING_CODE': '600', 'SECURITY_CODE': '0', 'JOB_ID': ''}
{'ID_NUMBER': '124ggfh', 'CHECKING_CODE': '100', 'SECURITY_CODE': '0', 'JOB_ID': 'O'}

我需要的输出:

{'ID_NUMBER': '1J2Wh', 'CHECKING_CODE': FALSE, 'SECURITY_CODE': TRUE, 'JOB_ID': TRUE}
{'ID_NUMBER': '124ggfh', 'CHECKING_CODE': TRUE, 'SECURITY_CODE': TRUE, 'JOB_ID': TRUE}

考虑到您拥有的 CSV 数据如下所示或处理如下:

然后是下面的代码:

csv1 = pd.read_csv('data1.csv')
csv2 = pd.read_csv('data2.csv')
csv3 = pd.merge(csv1,csv2, on='ID_NUMBER', how='inner')

#csv3 will have columns like ['ID_NUMBER', 'CHECKING_CODE_x', 'SECURITY_CODE_x', 'JOB_ID_x', 'CHECKING_CODE_y', 'SECURITY_CODE_y', 'JOB_ID_y']]

csv3['CHECKING_CODE'] = csv3.CHECKING_CODE_x == csv3.CHECKING_CODE_y
csv3['SECURITY_CODE'] = csv3.SECURITY_CODE_x == csv3.SECURITY_CODE_y
csv3['JOB_ID'] = csv3.JOB_ID_x == csv3.JOB_ID_y
csv_out = csv3[['ID_NUMBER','CHECKING_CODE',''SECURITY_CODE','JOB_ID']]

将为您提供以下格式的结果:

我会这样做:

# compare all columns except first one and join result DF with the first column
cols = df1.columns
cmp = df1[['ID_NUMBER']].join(df1[cols[1:]] == df2[cols[1:]])

一步一步:

In [53]: %paste
# generating sample data frame
cols = ['ID_NUMBER', 'CHECKING_CODE', 'SECURITY_CODE', 'JOB_ID']
df1 = pd.DataFrame(np.random.randint(0,100,size=(5, 4)), columns=cols)
df2 = df1.copy()

# changing some cells
df2.ix[1, 'JOB_ID'] = 100
df2.ix[2, 'SECURITY_CODE'] = 100
df2.ix[3, 'CHECKING_CODE'] = 100
## -- End pasted text --

In [54]: df1
Out[54]:
   ID_NUMBER  CHECKING_CODE  SECURITY_CODE  JOB_ID
0         58             47             62      72
1         75             21             67      99
2         49             70             92      30
3         80             85             95      78
4         64             82             21       2

In [55]: df2
Out[55]:
   ID_NUMBER  CHECKING_CODE  SECURITY_CODE  JOB_ID
0         58             47             62      72
1         75             21             67     100
2         49             70            100      30
3         80            100             95      78
4         64             82             21       2

In [56]: %paste
# comparison
cols = df1.columns
cmp = df1[['ID_NUMBER']].join(df1[cols[1:]] == df2[cols[1:]])
## -- End pasted text --

In [57]: cmp
Out[57]:
   ID_NUMBER CHECKING_CODE SECURITY_CODE JOB_ID
0         58          True          True   True
1         75          True          True  False
2         49          True         False   True
3         80         False          True   True
4         64          True          True   True

奖金答案: ("At the end, I want to add a column; if all the row values are "True" 它应该是 True,如果有一个 "False" 它应该是 false 这是我的代码"):

In [15]: df
Out[15]:
       a      b      c
0  False   True   True
1   True  False   True
2   True   True   True
3  False  False  False
4   True  False  False

In [16]: df['truth'] = df.sum(axis=1) > 0

In [17]: df
Out[17]:
       a      b      c  truth
0  False   True   True   True
1   True  False   True   True
2   True   True   True   True
3  False  False  False  False
4   True  False  False   True