替换 Pandas 列表类型列中的重复项

Question

背景信息：我有一个数据框 df，它有多个列，其中重点是名为 'genres'

的列

目标：

The problem can be seen in this image, there are entries where duplicates are found for example '[Drama, Romance]' and '[Romance, Drama]' are the same thing

现在 objective 将以编程方式 remove/replace 进行复制，以便将变体替换为其等价形式。

示例：

'[剧情，浪漫]'和'[浪漫，剧情]'

现在 [Romance, Drama] 被替换为 [Drama, Romance] 反之亦然而不是完全删除我们只是替换列表的内容

Output - Before Replacing Duplicates '[Drama, Romance]' and '[Romance, Drama]'

Expected Output - After Replacing Duplicates '[Drama, Romance]'

过滤 df 列 'genres' 以仅包含列表条目不超过 3 种类型的类型，例如删除任何超过 3 种类型的行。 'genres' 列中可接受的结果示例：

[爱情、剧情、喜剧]
[爱情、剧情]
[戏剧]

我试过以下方法：

#to delist the 'genres' column
df['genres'] = df.genres.apply(', '.join)

# code sample of manually replaced duplicated content in genres column
df['genres'] = df['genres'].str.replace("Romance, Drama","Drama, Romance")
df['genres'] = df['genres'].str.replace("Drama, Comedy","Comedy, Drama")

上面的代码有效，但它是针对单个重复项手动完成的，所以我想找到一种方法来为 df

的 'genres' 列中找到的所有重复项进行编码]

Answer 1

假定列中每一行的数据类型为 list：

您可以先使用 sorted
按行对列表进行排序

然后用loc过滤dataframe的行得到value_counts()

df['genres'] = df['genres'].apply(lambda x: sorted(x))
df.loc[df['genres'].apply(lambda x: len(x) <= 3), 'genres'].value_counts()

替换 Pandas 列表类型列中的重复项

Replacing Duplicates in a column of list type in Pandas

python

list

eda

pandas