从 csv 文件中的多行中提取唯一值

Question

我有房地产数据 csv 文件。行中有很多重复的信息，如下例所示：

Row1:
Su baldais, Skalbimo mašina, **Viryklė**, **Indaplovė**, Vonia
Row2
Virtuvės komplektas, **Viryklė**, **Indaplovė**, Dušo kabina, Rekuperacinė sistema

如你所见，有很多数据，这是重复的（我用星号标记了它）。有没有办法从 python 的所有行中只获取唯一值？

Answer 1

还不完全清楚你想要什么，所以我将包括两个场景：

您在 cwd 中的 example.csv 数据：

Su baldais,Skalbimo mašina,Viryklė,Indaplovė,Vonia
Virtuvės komplektas,Viryklė,Indaplovė,Dušo kabina,Rekuperacinė sistema

场景一

You want every value that appears in the csv, but do not want any value more than once. A perfect use case for a set, which will only store each value once.

#!/usr/bin/env python3
import csv

unique_values = set()

with open("example.csv") as handle:
    reader = csv.reader(handle)
    for row in reader:
        unique_values.update(row)

print(", ".join(unique_values))

结果：

Skalbimo mašina, Dušo kabina, Rekuperacinė sistema, Su baldais, Indaplovė, Virtuvės komplektas, Viryklė, Vonia

场景二

You want only the unique values from the csv, discarding any values that appear more than once.

#!/usr/bin/env python3
import csv

all_values = set()
to_delete = set()

with open("example.csv") as handle:
    reader = csv.reader(handle)
    for row in reader:
        for value in row:
            if value in all_values:
                to_delete.add(value)
            else:
                all_values.add(value)

print(", ".join(all_values - to_delete))

这里我使用两组，第二组称为 to_delete，其中包含我们不止一次看到的任何值。我运行 all_values - to_delete 只给我一组完全独特的值。

结果：

Dušo kabina, Su baldais, Virtuvės komplektas, Skalbimo mašina, Vonia, Rekuperacinė sistema

从 csv 文件中的多行中提取唯一值

Extracting unique values from multiple rows in csv file

python

csv