当列包含列表时检测 pandas 中的重复项

Detecting duplicates in pandas when a column contains lists

当列包含列表或 numpy nd 数组时,是否有合理的方法来检测 Pandas 数据框中的重复项,如下例所示?我知道我可以将列表转换成字符串,但是来回转换的行为感觉……不对。另外,考虑到 ~how I got here (online code) 和我要去的地方,列表看起来更清晰和方便。

import pandas as pd

df = pd.DataFrame(
    {
        "author": ["Jefe98", "Jefe98", "Alex", "Alex", "Qbert"],
        "date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
        "ingredients": [
            ["ingredA", "ingredB", "ingredC"],
            ["ingredA", "ingredB", "ingredC"],
            ["ingredA", "ingredB", "ingredD"],
            ["ingredA", "ingredB", "ingredD", "ingredE"],
            ["ingredB", "ingredC", "ingredF"],
        ],
    }
)

# Traditional find duplicates
# df[df.duplicated(keep=False)]

# Avoiding pandas duplicated function (question 70596016 solution)
i = [hash(tuple(i.values())) for i in df.to_dict(orient="records")]
j = [i.count(k) > 1 for k in i]
df[j]

两种方法(后者来自 )导致

TypeError: unhashable type: 'list'.

当然,如果数据框看起来像这样,它们就可以工作:

df = pd.DataFrame(
    {
        "author": ["Jefe98", "Jefe98", "Alex", "Alex", "Qbert"],
        "date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
        "recipe": [
            "recipeC",
            "recipeC",
            "recipeD",
            "recipeE",
            "recipeF",
        ],
    }
)

这让我想知道像整数编码这样的东西是否合理?它与转换 to/from 字符串没有什么不同,但至少它是清晰的。或者,如果建议直接从上面 code link 中的起始数据帧转换为每行的单个成分字符串,我们将不胜感激(即,完全避免列表)。

maptuple

out = df[df.assign(rating = df['rating'].map(tuple)).duplicated(keep=False)]
Out[295]: 
   author        date                       rating
0  Jefe98  1423112400  [ingredA, ingredB, ingredC]
1  Jefe98  1423112400  [ingredA, ingredB, ingredC]