使用 python 返回 excel 中两个不同文件中两列之间的差异
Returning differences between two columns in two different files in excel using python
我有两个 csv 文件,其中有一个名为 'Name' 的公共列。文件 2 将不断更新并在列中随机添加新值。我如何编写脚本来比较两列并找到差异,而不管新值在 file2 中的位置如何。
只有当新值在列的末尾而不是在列中随机出现时,其他解决方案才会发现差异。
我试过的代码(只输出列底部的新值,而不是随机出现在列中的时候):
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
new_df = (df1[['Name']].merge(df2[['Name']],on='Name',how = 'outer',indicator = True)
.query("_merge != 'both'")
.drop('_merge',axis = 1))
new_df.to_csv('file4.csv')
文件 1:
Name
gfd454
3v4fd
th678iy
文件 2:
Name
gfd454
fght45
3v4fd
th678iy
输出应该是:
Name
fght45
使用左侧的文件 2 进行左连接。之后,提取不匹配的 NaN 行。
如果您只想检查一列,您可以通过比较两个列表来尝试:
list1=df1['Name'].tolist()
list2=df2['Name'].tolist()
s = set(list1)
diff = [x for x in list2 if x not in s]
# df1 original dataframe of File_1 data
df1 = pd.DataFrame({'Name':[ 'gfd454' , '3v4fd', 'th678iy']})
# df2 dataframe of changing File_2 data
df2 = pd.DataFrame({'Name':[ 'gfd454' , 'abcde', 'fght45', '3v4fd', 'abcde' ,'th678iy', 'abcde']})
# Assuming df1 comprises distinct elements and doesn't change, and that
# df2 contains all elements of df1 and more (the new updates)
# df2 may have duplicates like 'abcde'
# Drop duplicates in df2, if df1 has duplicates also drop it first
# ``keep = first`` : Drop duplicates except for the first occurrence.
df2.drop_duplicates(keep='first', inplace=True)
print(df2)
# pandas.concat adds elements of df2 to df1, even if it already exists in df1
df_concat = pd.concat([df1,df2], join='outer', ignore_index = True)
print(df_concat)
# now drop the duplicates between df1, df2
df_diff = df_concat .drop_duplicates(keep=False)
print(df_diff)
现在,问题在于您必须确保 df1-df2 = {},
即 df1 是 df2
的子集
我有两个 csv 文件,其中有一个名为 'Name' 的公共列。文件 2 将不断更新并在列中随机添加新值。我如何编写脚本来比较两列并找到差异,而不管新值在 file2 中的位置如何。
只有当新值在列的末尾而不是在列中随机出现时,其他解决方案才会发现差异。
我试过的代码(只输出列底部的新值,而不是随机出现在列中的时候):
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
new_df = (df1[['Name']].merge(df2[['Name']],on='Name',how = 'outer',indicator = True)
.query("_merge != 'both'")
.drop('_merge',axis = 1))
new_df.to_csv('file4.csv')
文件 1:
Name
gfd454
3v4fd
th678iy
文件 2:
Name
gfd454
fght45
3v4fd
th678iy
输出应该是:
Name
fght45
使用左侧的文件 2 进行左连接。之后,提取不匹配的 NaN 行。
如果您只想检查一列,您可以通过比较两个列表来尝试:
list1=df1['Name'].tolist()
list2=df2['Name'].tolist()
s = set(list1)
diff = [x for x in list2 if x not in s]
# df1 original dataframe of File_1 data
df1 = pd.DataFrame({'Name':[ 'gfd454' , '3v4fd', 'th678iy']})
# df2 dataframe of changing File_2 data
df2 = pd.DataFrame({'Name':[ 'gfd454' , 'abcde', 'fght45', '3v4fd', 'abcde' ,'th678iy', 'abcde']})
# Assuming df1 comprises distinct elements and doesn't change, and that
# df2 contains all elements of df1 and more (the new updates)
# df2 may have duplicates like 'abcde'
# Drop duplicates in df2, if df1 has duplicates also drop it first
# ``keep = first`` : Drop duplicates except for the first occurrence.
df2.drop_duplicates(keep='first', inplace=True)
print(df2)
# pandas.concat adds elements of df2 to df1, even if it already exists in df1
df_concat = pd.concat([df1,df2], join='outer', ignore_index = True)
print(df_concat)
# now drop the duplicates between df1, df2
df_diff = df_concat .drop_duplicates(keep=False)
print(df_diff)
现在,问题在于您必须确保 df1-df2 = {}, 即 df1 是 df2
的子集