当列包含列表时检测 pandas 中的重复项
Detecting duplicates in pandas when a column contains lists
当列包含列表或 numpy nd 数组时,是否有合理的方法来检测 Pandas 数据框中的重复项,如下例所示?我知道我可以将列表转换成字符串,但是来回转换的行为感觉……不对。另外,考虑到 ~how I got here (online code) 和我要去的地方,列表看起来更清晰和方便。
import pandas as pd
df = pd.DataFrame(
{
"author": ["Jefe98", "Jefe98", "Alex", "Alex", "Qbert"],
"date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
"ingredients": [
["ingredA", "ingredB", "ingredC"],
["ingredA", "ingredB", "ingredC"],
["ingredA", "ingredB", "ingredD"],
["ingredA", "ingredB", "ingredD", "ingredE"],
["ingredB", "ingredC", "ingredF"],
],
}
)
# Traditional find duplicates
# df[df.duplicated(keep=False)]
# Avoiding pandas duplicated function (question 70596016 solution)
i = [hash(tuple(i.values())) for i in df.to_dict(orient="records")]
j = [i.count(k) > 1 for k in i]
df[j]
两种方法(后者来自 )导致
TypeError: unhashable type: 'list'.
当然,如果数据框看起来像这样,它们就可以工作:
df = pd.DataFrame(
{
"author": ["Jefe98", "Jefe98", "Alex", "Alex", "Qbert"],
"date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
"recipe": [
"recipeC",
"recipeC",
"recipeD",
"recipeE",
"recipeF",
],
}
)
这让我想知道像整数编码这样的东西是否合理?它与转换 to/from 字符串没有什么不同,但至少它是清晰的。或者,如果建议直接从上面 code link 中的起始数据帧转换为每行的单个成分字符串,我们将不胜感激(即,完全避免列表)。
与map
tuple
out = df[df.assign(rating = df['rating'].map(tuple)).duplicated(keep=False)]
Out[295]:
author date rating
0 Jefe98 1423112400 [ingredA, ingredB, ingredC]
1 Jefe98 1423112400 [ingredA, ingredB, ingredC]
当列包含列表或 numpy nd 数组时,是否有合理的方法来检测 Pandas 数据框中的重复项,如下例所示?我知道我可以将列表转换成字符串,但是来回转换的行为感觉……不对。另外,考虑到 ~how I got here (online code) 和我要去的地方,列表看起来更清晰和方便。
import pandas as pd
df = pd.DataFrame(
{
"author": ["Jefe98", "Jefe98", "Alex", "Alex", "Qbert"],
"date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
"ingredients": [
["ingredA", "ingredB", "ingredC"],
["ingredA", "ingredB", "ingredC"],
["ingredA", "ingredB", "ingredD"],
["ingredA", "ingredB", "ingredD", "ingredE"],
["ingredB", "ingredC", "ingredF"],
],
}
)
# Traditional find duplicates
# df[df.duplicated(keep=False)]
# Avoiding pandas duplicated function (question 70596016 solution)
i = [hash(tuple(i.values())) for i in df.to_dict(orient="records")]
j = [i.count(k) > 1 for k in i]
df[j]
两种方法(后者来自
TypeError: unhashable type: 'list'.
当然,如果数据框看起来像这样,它们就可以工作:
df = pd.DataFrame(
{
"author": ["Jefe98", "Jefe98", "Alex", "Alex", "Qbert"],
"date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
"recipe": [
"recipeC",
"recipeC",
"recipeD",
"recipeE",
"recipeF",
],
}
)
这让我想知道像整数编码这样的东西是否合理?它与转换 to/from 字符串没有什么不同,但至少它是清晰的。或者,如果建议直接从上面 code link 中的起始数据帧转换为每行的单个成分字符串,我们将不胜感激(即,完全避免列表)。
与map
tuple
out = df[df.assign(rating = df['rating'].map(tuple)).duplicated(keep=False)]
Out[295]:
author date rating
0 Jefe98 1423112400 [ingredA, ingredB, ingredC]
1 Jefe98 1423112400 [ingredA, ingredB, ingredC]