比较两个 CSV 文件并导出 Python 中的异同?
Comparing two CSV files and exporting the differences and similarities in Python?
大家好,我在尝试使用两个单独的 CSV 文件完成我需要的任务时遇到了一些问题。我发现一些散布在网络上的脚本可以满足我的要求,但不完全是。我不再拥有我尝试过的代码,因为我已经删除了很多次不同的代码,以至于我已经盯着一个空白的 py 文件看了很长一段时间。首先是 CSV 文件。
netscan.csv(包含计算机名称和序列号,数据正确,有型号)
Name Serial Models
computer1 serial1 model1
computer2 serial2 model2
computer3 serial3 model3
computer4 serial4 model4
... ...
computer_list.csv(包含计算机名称和序列号,名称正确,名称不在netscan.csv,没有型号,序列号错误)
Name Serial Models
computer1 serialZ
computerH serialN/A
computer3 serialQ
computer4 serialX
computer2 serialM
computerP serialN/A
所以我想做的是查看这两个文件,如果 Name
列中的值匹配,我希望它将 netscan.csv
中的行打印到一个新文件中并执行这对于每一行。之后,我希望它获取所有不存在的内容(例如 netscan.csv 中不存在的 computerH),并将它们添加到更新后的正确信息下的新 csv 中。像这样:
Name Serial Models
computer1 serial1 model1
computer2 serial2 model2
computer3 serial3 model3
computer4 serial4 model4
computerH serialN/A
computerP serialN/A
我已经尝试过合并、for 循环、写入行等,但此时我不知道如何完成此操作。任何帮助将不胜感激。
编辑:@unutbu 我从你的代码中得到的本质上是
Name Serial Models
computer1 serial1 model1
computer2 serial2 model2
computer3 serial3 model3
computer4 serial4 model4
computerH serialN/A
computerP serialN/A
computer2 serialN/A
computer3 serialN/A
computer4 serialN/A
因此,尽管几乎所有内容都是正确的,但 computer_list.csv
中仍有重复的 Name
行,如果它们被正确的信息替换,则需要删除这些行。所以我想查找具有重复名称的行,如果序列号为 serialN/A,则将其删除。希望这更有意义。
这可能有助于各种比较
import numpy as np
import pandas as pd
#file_name = "list.xlsx"
df = pd.DataFrame({'List1':[1,2,3,4,5,5,11,4],'List 2':[3,5,6,8,9,3,4,9]}, columns=['List1', 'List 2'])#pd.read_excel(file_name, sheetname=0)
print(df)
#df.to_excel("list1.xlsx", header=True, index=False)
df['Intersect']=pd.DataFrame(np.intersect1d(df['List1'], df['List 2'])) #unique common in both
df['commonin1']=df['List1'][np.in1d(df['List1'], df['List 2'])] #non unique common items of list 1
df['commonin2']=df['List 2'][np.in1d(df['List 2'], df['List1'])] #non unique common items of list 2
df['1not2']=pd.DataFrame(np.setdiff1d(df['List1'], df['List 2'])) #in list1 but not in list 2
df['2not1']=pd.DataFrame(np.setdiff1d(df['List 2'], df['List1'])) #in list 2 but not in list1
df['1not2NU']=df['List1'][~np.in1d(df['List1'], df['List 2'])] #in list1 but not in list 2 non unique
df['2not1NU']=df['List 2'][~np.in1d(df['List 2'], df['List1'])] #in list 2 but not in list1 non unique
df['exclusive']=pd.DataFrame(np.setxor1d(df['List1'], df['List 2'])) # in a and not b + in b but not a
df=pd.concat([df,pd.DataFrame(np.union1d(df['List1'], df['List 2']), columns=['Union'])], axis=1) # unique all
df
看看这个:
import pandas as pd
netscan = pd.read_csv('netscan.csv', header=0) # read netscan.csv and columns names are from the first row of your csv
computer_list = pd.read_csv('computer_list.csv', header=0)
# An inner merge keeps only row found in both pandas.DataFrame
computer_match = netscan.merge(right=computer_list, how='inner', on='Name', suffixes=('netscan_', 'computer_list_'))
# Get list of Name of computers that matches
match_list = computer_match.Name.unique().tolist()
# Get characteristics of not matched computers
computer_no_match = computer_list.loc[computer_list.Name.isin(match_list), :]
# Finally, save everything to CSV
computer_match.to_csv('computer_match.csv', index=False)
computer_no_match.to_csv('computer_no_match.csv', index=False)
您可以合并 netscan
和 computer
DataFrame,然后用 SerialN/A
填充 Serial
列中的缺失值。
import pandas as pd
netscan = pd.read_csv('netscan.csv')
computer = pd.read_csv('computer_list.csv', usecols=['Name'])
for df in [netscan, computer]:
df['Name'] = df['Name'].str.rstrip()
result = pd.merge(netscan, computer, on='Name', how='outer')
result['Serial'] = result['Serial'].fillna('SerialN/A')
result.to_csv('result.csv', index=False)
print(result)
生成包含
的 CSV 文件 (result.csv
)
Name,Serial,Models
computer1,serial1,model1
computer2,serial2,model2
computer3,serial3,model3
computer4,serial4,model4
computerH,SerialN/A,
computerP,SerialN/A,
大家好,我在尝试使用两个单独的 CSV 文件完成我需要的任务时遇到了一些问题。我发现一些散布在网络上的脚本可以满足我的要求,但不完全是。我不再拥有我尝试过的代码,因为我已经删除了很多次不同的代码,以至于我已经盯着一个空白的 py 文件看了很长一段时间。首先是 CSV 文件。
netscan.csv(包含计算机名称和序列号,数据正确,有型号)
Name Serial Models
computer1 serial1 model1
computer2 serial2 model2
computer3 serial3 model3
computer4 serial4 model4
... ...
computer_list.csv(包含计算机名称和序列号,名称正确,名称不在netscan.csv,没有型号,序列号错误)
Name Serial Models
computer1 serialZ
computerH serialN/A
computer3 serialQ
computer4 serialX
computer2 serialM
computerP serialN/A
所以我想做的是查看这两个文件,如果 Name
列中的值匹配,我希望它将 netscan.csv
中的行打印到一个新文件中并执行这对于每一行。之后,我希望它获取所有不存在的内容(例如 netscan.csv 中不存在的 computerH),并将它们添加到更新后的正确信息下的新 csv 中。像这样:
Name Serial Models
computer1 serial1 model1
computer2 serial2 model2
computer3 serial3 model3
computer4 serial4 model4
computerH serialN/A
computerP serialN/A
我已经尝试过合并、for 循环、写入行等,但此时我不知道如何完成此操作。任何帮助将不胜感激。
编辑:@unutbu 我从你的代码中得到的本质上是
Name Serial Models
computer1 serial1 model1
computer2 serial2 model2
computer3 serial3 model3
computer4 serial4 model4
computerH serialN/A
computerP serialN/A
computer2 serialN/A
computer3 serialN/A
computer4 serialN/A
因此,尽管几乎所有内容都是正确的,但 computer_list.csv
中仍有重复的 Name
行,如果它们被正确的信息替换,则需要删除这些行。所以我想查找具有重复名称的行,如果序列号为 serialN/A,则将其删除。希望这更有意义。
这可能有助于各种比较
import numpy as np
import pandas as pd
#file_name = "list.xlsx"
df = pd.DataFrame({'List1':[1,2,3,4,5,5,11,4],'List 2':[3,5,6,8,9,3,4,9]}, columns=['List1', 'List 2'])#pd.read_excel(file_name, sheetname=0)
print(df)
#df.to_excel("list1.xlsx", header=True, index=False)
df['Intersect']=pd.DataFrame(np.intersect1d(df['List1'], df['List 2'])) #unique common in both
df['commonin1']=df['List1'][np.in1d(df['List1'], df['List 2'])] #non unique common items of list 1
df['commonin2']=df['List 2'][np.in1d(df['List 2'], df['List1'])] #non unique common items of list 2
df['1not2']=pd.DataFrame(np.setdiff1d(df['List1'], df['List 2'])) #in list1 but not in list 2
df['2not1']=pd.DataFrame(np.setdiff1d(df['List 2'], df['List1'])) #in list 2 but not in list1
df['1not2NU']=df['List1'][~np.in1d(df['List1'], df['List 2'])] #in list1 but not in list 2 non unique
df['2not1NU']=df['List 2'][~np.in1d(df['List 2'], df['List1'])] #in list 2 but not in list1 non unique
df['exclusive']=pd.DataFrame(np.setxor1d(df['List1'], df['List 2'])) # in a and not b + in b but not a
df=pd.concat([df,pd.DataFrame(np.union1d(df['List1'], df['List 2']), columns=['Union'])], axis=1) # unique all
df
看看这个:
import pandas as pd
netscan = pd.read_csv('netscan.csv', header=0) # read netscan.csv and columns names are from the first row of your csv
computer_list = pd.read_csv('computer_list.csv', header=0)
# An inner merge keeps only row found in both pandas.DataFrame
computer_match = netscan.merge(right=computer_list, how='inner', on='Name', suffixes=('netscan_', 'computer_list_'))
# Get list of Name of computers that matches
match_list = computer_match.Name.unique().tolist()
# Get characteristics of not matched computers
computer_no_match = computer_list.loc[computer_list.Name.isin(match_list), :]
# Finally, save everything to CSV
computer_match.to_csv('computer_match.csv', index=False)
computer_no_match.to_csv('computer_no_match.csv', index=False)
您可以合并 netscan
和 computer
DataFrame,然后用 SerialN/A
填充 Serial
列中的缺失值。
import pandas as pd
netscan = pd.read_csv('netscan.csv')
computer = pd.read_csv('computer_list.csv', usecols=['Name'])
for df in [netscan, computer]:
df['Name'] = df['Name'].str.rstrip()
result = pd.merge(netscan, computer, on='Name', how='outer')
result['Serial'] = result['Serial'].fillna('SerialN/A')
result.to_csv('result.csv', index=False)
print(result)
生成包含
的 CSV 文件 (result.csv
)
Name,Serial,Models
computer1,serial1,model1
computer2,serial2,model2
computer3,serial3,model3
computer4,serial4,model4
computerH,SerialN/A,
computerP,SerialN/A,