当列包含列表时检测 pandas 中的重复项

Question

当列包含列表或 numpy nd 数组时，是否有合理的方法来检测 Pandas 数据框中的重复项，如下例所示？我知道我可以将列表转换成字符串，但是来回转换的行为感觉……不对。另外，考虑到 ~how I got here (online code) 和我要去的地方，列表看起来更清晰和方便。

import pandas as pd

df = pd.DataFrame(
    {
        "author": ["Jefe98", "Jefe98", "Alex", "Alex", "Qbert"],
        "date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
        "ingredients": [
            ["ingredA", "ingredB", "ingredC"],
            ["ingredA", "ingredB", "ingredC"],
            ["ingredA", "ingredB", "ingredD"],
            ["ingredA", "ingredB", "ingredD", "ingredE"],
            ["ingredB", "ingredC", "ingredF"],
        ],
    }
)

# Traditional find duplicates
# df[df.duplicated(keep=False)]

# Avoiding pandas duplicated function (question 70596016 solution)
i = [hash(tuple(i.values())) for i in df.to_dict(orient="records")]
j = [i.count(k) > 1 for k in i]
df[j]

两种方法（后者来自）导致

TypeError: unhashable type: 'list'.

当然，如果数据框看起来像这样，它们就可以工作：

df = pd.DataFrame(
    {
        "author": ["Jefe98", "Jefe98", "Alex", "Alex", "Qbert"],
        "date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
        "recipe": [
            "recipeC",
            "recipeC",
            "recipeD",
            "recipeE",
            "recipeF",
        ],
    }
)

这让我想知道像整数编码这样的东西是否合理？它与转换 to/from 字符串没有什么不同，但至少它是清晰的。或者，如果建议直接从上面 code link 中的起始数据帧转换为每行的单个成分字符串，我们将不胜感激（即，完全避免列表）。

Answer 1

与maptuple

out = df[df.assign(rating = df['rating'].map(tuple)).duplicated(keep=False)]
Out[295]: 
   author        date                       rating
0  Jefe98  1423112400  [ingredA, ingredB, ingredC]
1  Jefe98  1423112400  [ingredA, ingredB, ingredC]

当列包含列表时检测 pandas 中的重复项

Detecting duplicates in pandas when a column contains lists

python

duplicates

dataframe

pandas