如何使用 python 删除 csv 中的 duplicate/repeated 行?
How to remove duplicate/repeated rows in csv with python?
我正在使用 python 抓取网络并将数据获取到 .csv 文件,如下所示。如果我附加到文件,我可能会有一些 repeated/duplicate 数据。为了避免这种情况,我可以使用什么?我不确定 pandas - 如果我应该在 pandas 中打开文件然后删除重复项。我尝试了自己的其他方法,但无法提出解决方案。我在考虑使用 pandas 作为最后一个选项
Date,Time,Status,School,GPA,GRE,GMAT,Round,Location,Post-MBA Career,via,on,Details,Note
2021-05-18,13:59:00,Accepted from Waitlist,Yale SOM,3.8,No data provided,740,Round 2 ,NYC,Non Profit / Social Impact,phone,2021-05-18,GPA: 3.8 GMAT: 740 Round: Round 2 | NYC,Interviewed and was waitlisted in R2. Just received the call this afternoon. Good luck everyone!
2021-05-18,13:51:00,Accepted from Waitlist,Yale SOM,3.8,323,No data provided,Round 2 ,Austin,Marketing,phone,2021-05-18,GPA: 3.8 GRE: 323 Round: Round 2 | Austin,Keep your head up! It all works out how it is supposed to.
也许一次读一行,将它们存储在一个集合中(这样就没有重复),然后再写回去?
lines = set()
file = 'foo.txt'
with open (file) as fd:
for line in fd:
lines.add(line)
with open(file, 'w') as fd:
fd.write(''.join(lines))
如果你想用 pandas
# 1. Read CSV
df = pd.read_csv("data.csv")
# 2(a). For complete row duplicate
pd.drop_duplicates(inplace=True)
# 2(b). For partials
pd.drop_duplicates(subset=['Date', 'Time', <other_fields>], inplace=True)
# 3. Save then
pd.to_csv("data.csv", index=False)
我正在使用 python 抓取网络并将数据获取到 .csv 文件,如下所示。如果我附加到文件,我可能会有一些 repeated/duplicate 数据。为了避免这种情况,我可以使用什么?我不确定 pandas - 如果我应该在 pandas 中打开文件然后删除重复项。我尝试了自己的其他方法,但无法提出解决方案。我在考虑使用 pandas 作为最后一个选项
Date,Time,Status,School,GPA,GRE,GMAT,Round,Location,Post-MBA Career,via,on,Details,Note
2021-05-18,13:59:00,Accepted from Waitlist,Yale SOM,3.8,No data provided,740,Round 2 ,NYC,Non Profit / Social Impact,phone,2021-05-18,GPA: 3.8 GMAT: 740 Round: Round 2 | NYC,Interviewed and was waitlisted in R2. Just received the call this afternoon. Good luck everyone!
2021-05-18,13:51:00,Accepted from Waitlist,Yale SOM,3.8,323,No data provided,Round 2 ,Austin,Marketing,phone,2021-05-18,GPA: 3.8 GRE: 323 Round: Round 2 | Austin,Keep your head up! It all works out how it is supposed to.
也许一次读一行,将它们存储在一个集合中(这样就没有重复),然后再写回去?
lines = set()
file = 'foo.txt'
with open (file) as fd:
for line in fd:
lines.add(line)
with open(file, 'w') as fd:
fd.write(''.join(lines))
如果你想用 pandas
# 1. Read CSV
df = pd.read_csv("data.csv")
# 2(a). For complete row duplicate
pd.drop_duplicates(inplace=True)
# 2(b). For partials
pd.drop_duplicates(subset=['Date', 'Time', <other_fields>], inplace=True)
# 3. Save then
pd.to_csv("data.csv", index=False)