如何将 Python 中两个 CSV 文件的结果与 DictReader 或 Pandas 进行比较? (也可以使用任何其他方法!)
How can I compare results from two CSV files in Python with DictReader or Pandas? (Open to any other methods as well!)
我正在尝试比较两个 CSV 文件的结果。我希望它检查 CSV1 中的第一列 (ID_NUMBER) 是否与 CSV2 匹配,然后我希望它检查其余列中的值是否也匹配,并给出 "True" 的输出或 "False" 对应的列。最后,我想添加一列;如果所有行值都是 "True" 它应该是 True,如果有一个 "False" 它应该是 false
这是我的代码:
import csv
import os
import sys
import difflib
def compare():
list_csv1 = []
with open('query1_results.csv' , 'rb') as query_results1:
reader1 = csv.DictReader(query_results1)
for row1 in reader1:
list_csv1.append(row1)
list_csv2= []
with open('query2_results.csv' , 'rb') as query_results2:
reader2 = csv.DictReader(query_results2)
for row2 in reader2:
list_csv2.append(row2)
CSV1:
{'ID_NUMBER': '1J2Wh', 'CHECKING_CODE': '20', 'SECURITY_CODE': '0', 'JOB_ID': ''}
{'ID_NUMBER': '124ggfh', 'CHECKING_CODE': '100', 'SECURITY_CODE': '0', 'JOB_ID': 'O'}
CSV2:
{'ID_NUMBER': '1J2Wh', 'CHECKING_CODE': '600', 'SECURITY_CODE': '0', 'JOB_ID': ''}
{'ID_NUMBER': '124ggfh', 'CHECKING_CODE': '100', 'SECURITY_CODE': '0', 'JOB_ID': 'O'}
我需要的输出:
{'ID_NUMBER': '1J2Wh', 'CHECKING_CODE': FALSE, 'SECURITY_CODE': TRUE, 'JOB_ID': TRUE}
{'ID_NUMBER': '124ggfh', 'CHECKING_CODE': TRUE, 'SECURITY_CODE': TRUE, 'JOB_ID': TRUE}
考虑到您拥有的 CSV 数据如下所示或处理如下:
然后是下面的代码:
csv1 = pd.read_csv('data1.csv')
csv2 = pd.read_csv('data2.csv')
csv3 = pd.merge(csv1,csv2, on='ID_NUMBER', how='inner')
#csv3 will have columns like ['ID_NUMBER', 'CHECKING_CODE_x', 'SECURITY_CODE_x', 'JOB_ID_x', 'CHECKING_CODE_y', 'SECURITY_CODE_y', 'JOB_ID_y']]
csv3['CHECKING_CODE'] = csv3.CHECKING_CODE_x == csv3.CHECKING_CODE_y
csv3['SECURITY_CODE'] = csv3.SECURITY_CODE_x == csv3.SECURITY_CODE_y
csv3['JOB_ID'] = csv3.JOB_ID_x == csv3.JOB_ID_y
csv_out = csv3[['ID_NUMBER','CHECKING_CODE',''SECURITY_CODE','JOB_ID']]
将为您提供以下格式的结果:
我会这样做:
# compare all columns except first one and join result DF with the first column
cols = df1.columns
cmp = df1[['ID_NUMBER']].join(df1[cols[1:]] == df2[cols[1:]])
一步一步:
In [53]: %paste
# generating sample data frame
cols = ['ID_NUMBER', 'CHECKING_CODE', 'SECURITY_CODE', 'JOB_ID']
df1 = pd.DataFrame(np.random.randint(0,100,size=(5, 4)), columns=cols)
df2 = df1.copy()
# changing some cells
df2.ix[1, 'JOB_ID'] = 100
df2.ix[2, 'SECURITY_CODE'] = 100
df2.ix[3, 'CHECKING_CODE'] = 100
## -- End pasted text --
In [54]: df1
Out[54]:
ID_NUMBER CHECKING_CODE SECURITY_CODE JOB_ID
0 58 47 62 72
1 75 21 67 99
2 49 70 92 30
3 80 85 95 78
4 64 82 21 2
In [55]: df2
Out[55]:
ID_NUMBER CHECKING_CODE SECURITY_CODE JOB_ID
0 58 47 62 72
1 75 21 67 100
2 49 70 100 30
3 80 100 95 78
4 64 82 21 2
In [56]: %paste
# comparison
cols = df1.columns
cmp = df1[['ID_NUMBER']].join(df1[cols[1:]] == df2[cols[1:]])
## -- End pasted text --
In [57]: cmp
Out[57]:
ID_NUMBER CHECKING_CODE SECURITY_CODE JOB_ID
0 58 True True True
1 75 True True False
2 49 True False True
3 80 False True True
4 64 True True True
奖金答案: ("At the end, I want to add a column; if all the row values are "True" 它应该是 True,如果有一个 "False" 它应该是 false 这是我的代码"):
In [15]: df
Out[15]:
a b c
0 False True True
1 True False True
2 True True True
3 False False False
4 True False False
In [16]: df['truth'] = df.sum(axis=1) > 0
In [17]: df
Out[17]:
a b c truth
0 False True True True
1 True False True True
2 True True True True
3 False False False False
4 True False False True
我正在尝试比较两个 CSV 文件的结果。我希望它检查 CSV1 中的第一列 (ID_NUMBER) 是否与 CSV2 匹配,然后我希望它检查其余列中的值是否也匹配,并给出 "True" 的输出或 "False" 对应的列。最后,我想添加一列;如果所有行值都是 "True" 它应该是 True,如果有一个 "False" 它应该是 false 这是我的代码:
import csv
import os
import sys
import difflib
def compare():
list_csv1 = []
with open('query1_results.csv' , 'rb') as query_results1:
reader1 = csv.DictReader(query_results1)
for row1 in reader1:
list_csv1.append(row1)
list_csv2= []
with open('query2_results.csv' , 'rb') as query_results2:
reader2 = csv.DictReader(query_results2)
for row2 in reader2:
list_csv2.append(row2)
CSV1:
{'ID_NUMBER': '1J2Wh', 'CHECKING_CODE': '20', 'SECURITY_CODE': '0', 'JOB_ID': ''}
{'ID_NUMBER': '124ggfh', 'CHECKING_CODE': '100', 'SECURITY_CODE': '0', 'JOB_ID': 'O'}
CSV2:
{'ID_NUMBER': '1J2Wh', 'CHECKING_CODE': '600', 'SECURITY_CODE': '0', 'JOB_ID': ''}
{'ID_NUMBER': '124ggfh', 'CHECKING_CODE': '100', 'SECURITY_CODE': '0', 'JOB_ID': 'O'}
我需要的输出:
{'ID_NUMBER': '1J2Wh', 'CHECKING_CODE': FALSE, 'SECURITY_CODE': TRUE, 'JOB_ID': TRUE}
{'ID_NUMBER': '124ggfh', 'CHECKING_CODE': TRUE, 'SECURITY_CODE': TRUE, 'JOB_ID': TRUE}
考虑到您拥有的 CSV 数据如下所示或处理如下:
然后是下面的代码:
csv1 = pd.read_csv('data1.csv')
csv2 = pd.read_csv('data2.csv')
csv3 = pd.merge(csv1,csv2, on='ID_NUMBER', how='inner')
#csv3 will have columns like ['ID_NUMBER', 'CHECKING_CODE_x', 'SECURITY_CODE_x', 'JOB_ID_x', 'CHECKING_CODE_y', 'SECURITY_CODE_y', 'JOB_ID_y']]
csv3['CHECKING_CODE'] = csv3.CHECKING_CODE_x == csv3.CHECKING_CODE_y
csv3['SECURITY_CODE'] = csv3.SECURITY_CODE_x == csv3.SECURITY_CODE_y
csv3['JOB_ID'] = csv3.JOB_ID_x == csv3.JOB_ID_y
csv_out = csv3[['ID_NUMBER','CHECKING_CODE',''SECURITY_CODE','JOB_ID']]
将为您提供以下格式的结果:
我会这样做:
# compare all columns except first one and join result DF with the first column
cols = df1.columns
cmp = df1[['ID_NUMBER']].join(df1[cols[1:]] == df2[cols[1:]])
一步一步:
In [53]: %paste
# generating sample data frame
cols = ['ID_NUMBER', 'CHECKING_CODE', 'SECURITY_CODE', 'JOB_ID']
df1 = pd.DataFrame(np.random.randint(0,100,size=(5, 4)), columns=cols)
df2 = df1.copy()
# changing some cells
df2.ix[1, 'JOB_ID'] = 100
df2.ix[2, 'SECURITY_CODE'] = 100
df2.ix[3, 'CHECKING_CODE'] = 100
## -- End pasted text --
In [54]: df1
Out[54]:
ID_NUMBER CHECKING_CODE SECURITY_CODE JOB_ID
0 58 47 62 72
1 75 21 67 99
2 49 70 92 30
3 80 85 95 78
4 64 82 21 2
In [55]: df2
Out[55]:
ID_NUMBER CHECKING_CODE SECURITY_CODE JOB_ID
0 58 47 62 72
1 75 21 67 100
2 49 70 100 30
3 80 100 95 78
4 64 82 21 2
In [56]: %paste
# comparison
cols = df1.columns
cmp = df1[['ID_NUMBER']].join(df1[cols[1:]] == df2[cols[1:]])
## -- End pasted text --
In [57]: cmp
Out[57]:
ID_NUMBER CHECKING_CODE SECURITY_CODE JOB_ID
0 58 True True True
1 75 True True False
2 49 True False True
3 80 False True True
4 64 True True True
奖金答案: ("At the end, I want to add a column; if all the row values are "True" 它应该是 True,如果有一个 "False" 它应该是 false 这是我的代码"):
In [15]: df
Out[15]:
a b c
0 False True True
1 True False True
2 True True True
3 False False False
4 True False False
In [16]: df['truth'] = df.sum(axis=1) > 0
In [17]: df
Out[17]:
a b c truth
0 False True True True
1 True False True True
2 True True True True
3 False False False False
4 True False False True