比较 2 个 csv 文件并从第一个文件中删除公共行 | python
Compare 2 csv files and remove the common lines from 1st file | python
我想比较 2 个 csv 文件 master.csv 和 exclude.csv 并删除所有基于 column1 的匹配行并将最终输出写入 mater.csv 文件。
master.csv
abc,xyz
cde,fgh
ijk,lmn
exclude.csv
###Exclude list###
cde
####
预期输出(它应该覆盖 master.csv
abc,xyz
ijk,lmn
尝试到现在
with open('exclude.csv','r') as in_file, open('master.csv','w') as out_file:
seen = set()
for line in in_file:
if line in seen: continue # skip duplicate
seen.add(line)
out_file.write(line)
我相信应该有一些 pandas
或其他模块方法,但这里是一个纯 pythonic 方法:
with open("master.csv") as f:
master = f.read()
with open("exclude.csv") as f:
exclude = f.read()
master = master.strip().split("\n")
exclude = exclude.strip().split("\n")
returnList = []
for line in master:
check = True
for exc in exclude:
if exc in line:
check = False
break
if check:
returnList.append(line)
with open("master.csv", "w") as f:
f.write("\n".join(returnList))
master.csv
的输出
abc,xyz
ijk,lmn
最简单的方法是使用 pandas..
import pandas as pd
# Reading the csv file
df_new = pd.read_csv('Names.csv')
# saving xlsx file
GFG = pd.ExcelWriter('Names.xlsx')
df_new.to_excel(GFG, index=False)
GFG.save()
利用列表理解的纯 pythonic 答案:
with open('master.csv', 'r') as f:
keep_lines = f.readlines()
with open('exclude.csv', 'r') as f:
drop_lines = f.readlines()
write_lines = [line[0] for line in zip(keep_lines, drop_lines) if line[0].strip().split(',')[0] != line[1].strip()]
with open('master.csv', 'w') as f:
f.writelines(write_lines)
您可以这样使用 pandas:
import pandas as pd
master_df=pd.read_csv('master.csv')
exclude_df=pd.read_csv('exclude.csv')
conc=pd.concat([master_df,exclude_df]) #concatenate two dataframe
conc.drop_duplicates(subset=['col1'],inplace=True,keep=False)
print(conc)
drop_duplicates with subset = col1 将仅检查 col1 中的重复项
and keep 有 3 个值 allowed:first,last 和 False...
我选择 keep=False 不保留任何重复项
数据集:
master.csv:
col1,col2
abc,xyz
cde,fgh
ijk,lmn
exclude.csv:
col1
cde
我想比较 2 个 csv 文件 master.csv 和 exclude.csv 并删除所有基于 column1 的匹配行并将最终输出写入 mater.csv 文件。
master.csv
abc,xyz
cde,fgh
ijk,lmn
exclude.csv
###Exclude list###
cde
####
预期输出(它应该覆盖 master.csv
abc,xyz
ijk,lmn
尝试到现在
with open('exclude.csv','r') as in_file, open('master.csv','w') as out_file:
seen = set()
for line in in_file:
if line in seen: continue # skip duplicate
seen.add(line)
out_file.write(line)
我相信应该有一些 pandas
或其他模块方法,但这里是一个纯 pythonic 方法:
with open("master.csv") as f:
master = f.read()
with open("exclude.csv") as f:
exclude = f.read()
master = master.strip().split("\n")
exclude = exclude.strip().split("\n")
returnList = []
for line in master:
check = True
for exc in exclude:
if exc in line:
check = False
break
if check:
returnList.append(line)
with open("master.csv", "w") as f:
f.write("\n".join(returnList))
master.csv
的输出abc,xyz
ijk,lmn
最简单的方法是使用 pandas..
import pandas as pd
# Reading the csv file
df_new = pd.read_csv('Names.csv')
# saving xlsx file
GFG = pd.ExcelWriter('Names.xlsx')
df_new.to_excel(GFG, index=False)
GFG.save()
利用列表理解的纯 pythonic 答案:
with open('master.csv', 'r') as f:
keep_lines = f.readlines()
with open('exclude.csv', 'r') as f:
drop_lines = f.readlines()
write_lines = [line[0] for line in zip(keep_lines, drop_lines) if line[0].strip().split(',')[0] != line[1].strip()]
with open('master.csv', 'w') as f:
f.writelines(write_lines)
您可以这样使用 pandas:
import pandas as pd
master_df=pd.read_csv('master.csv')
exclude_df=pd.read_csv('exclude.csv')
conc=pd.concat([master_df,exclude_df]) #concatenate two dataframe
conc.drop_duplicates(subset=['col1'],inplace=True,keep=False)
print(conc)
drop_duplicates with subset = col1 将仅检查 col1 中的重复项
and keep 有 3 个值 allowed:first,last 和 False... 我选择 keep=False 不保留任何重复项
数据集:
master.csv:
col1,col2
abc,xyz
cde,fgh
ijk,lmn
exclude.csv:
col1
cde