如何使用 python 删除 csv 中的 duplicate/repeated 行？

Question

我正在使用 python 抓取网络并将数据获取到 .csv 文件，如下所示。如果我附加到文件，我可能会有一些 repeated/duplicate 数据。为了避免这种情况，我可以使用什么？我不确定 pandas - 如果我应该在 pandas 中打开文件然后删除重复项。我尝试了自己的其他方法，但无法提出解决方案。我在考虑使用 pandas 作为最后一个选项

Date,Time,Status,School,GPA,GRE,GMAT,Round,Location,Post-MBA Career,via,on,Details,Note

2021-05-18,13:59:00,Accepted from Waitlist,Yale SOM,3.8,No data provided,740,Round 2 ,NYC,Non Profit / Social Impact,phone,2021-05-18,GPA: 3.8 GMAT: 740 Round: Round 2 | NYC,Interviewed and was waitlisted in R2. Just received the call this afternoon. Good luck everyone!

2021-05-18,13:51:00,Accepted from Waitlist,Yale SOM,3.8,323,No data provided,Round 2 ,Austin,Marketing,phone,2021-05-18,GPA: 3.8 GRE: 323 Round: Round 2 | Austin,Keep your head up! It all works out how it is supposed to.

Answer 1

也许一次读一行，将它们存储在一个集合中（这样就没有重复），然后再写回去？

lines = set()
file = 'foo.txt'
with open (file) as fd:
    for line in fd:
        lines.add(line)
with open(file, 'w') as fd:
    fd.write(''.join(lines))

Answer 2

如果你想用 pandas

# 1. Read CSV
df = pd.read_csv("data.csv")

# 2(a). For complete row duplicate
pd.drop_duplicates(inplace=True)
             
# 2(b). For partials
pd.drop_duplicates(subset=['Date', 'Time', <other_fields>], inplace=True)

# 3. Save then
pd.to_csv("data.csv", index=False)

如何使用 python 删除 csv 中的 duplicate/repeated 行？

How to remove duplicate/repeated rows in csv with python?

python

csv

duplicates

file-handling

pandas