如何比较 python 中的两个文本文件并删除重复项？

Question

我是 python 的新人。我有两个包含 url 列表的文本文件。我想将 text1 文件与 text2 文件进行比较，并从 text1 文件中删除匹配 url 的 text2。

我的文本文件如下所示：

text2

https://www.basketbal.vlaanderen/clubs/detail/bbc-wervik
https://www.basketbal.vlaanderen/clubs/detail/bbc-alsemberg
https://www.basketbal.vlaanderen/clubs/detail/koninklijk-basket-team-ion-waregem
https://www.basketbal.vlaanderen/clubs/detail/basket-poperinge

text1

https://www.basketbal.vlaanderen/clubs/detail/bbc-erembodegem
https://www.basketbal.vlaanderen/clubs/detail/dbc-osiris-okapi-aalst
https://www.basketbal.vlaanderen/clubs/detail/the-tower-aalst
https://www.basketbal.vlaanderen/clubs/detail/gsg-aarschot
https://www.basketbal.vlaanderen/clubs/detail/bbc-wervik #duplicate url from text2
https://www.basketbal.vlaanderen/clubs/detail/bbc-alsemberg #duplicate url from text 2

在 google 搜索后，我发现了很少的解决方案，但这些解决方案只能从当前文件中删除重复项。

pandas解决方案用于删除重复项

df.drop_duplicates(subset ="link", keep ='first', inplace = True)

python 正则表达式

import re
re.sub('<.*?>', '', string) #it's not removing duplicate just replacing string with with nothing ('').

我没有找到更好的解决方案来比较 python 中的两个文本文件以删除重复项。如果任何 text1 文件 url 与 text2 文件匹配，则匹配 url 从 text1 文件中删除。知道如何在 python 中做到这一点吗？

Answer 1

如果文件的顺序无关紧要，您可以这样做：

with open("file1.txt") as f1:
    set1 = set(f1.readlines())
with open("file2.txt") as f2:
    set2 = set(f2.readlines())

nondups = set1 - set2

with open("file1.txt", "w") as out:
    out.writelines(nondups)

这会将每个文件的内容转换为一组行。然后它从第一个集合中删除公共元素，并将结果写回第一个文件。

如何比较 python 中的两个文本文件并删除重复项？

how compare two text file in python and delete duplicate?

python

python-re