Pandas Duplicate() return 除一行外所有重复一次

Question

我正在尝试获取自 1901 年至 2016 年以来不止一次获奖的所有诺贝尔奖获得者。我尝试了 pandas duplicate() 方法，但它 return 所有重复一次，除了一行或一项。我正在根据 DataFrame 中的 full_name 列获取重复项。我尝试了不同的参数组合，但得到了相同的结果。我知道我可以手动删除那一行，但这里出了什么问题。我的代码为

尝试-1

lucky_winners = df[df.duplicated(['full_name'])]

Try-2

lucky_winners = df[df.duplicated(['full_name'], keep='first')]

Try-3

lucky_winners = df[df.duplicated(['full_name'], keep='last')]

相同的输出：

lucky_winners.full_name

62                           Marie Curie, née Sklodowska
215    Comité international de la Croix Rouge (Intern...
340                                   Linus Carl Pauling
348    Comité international de la Croix Rouge (Intern...
424                                         John Bardeen
505                                     Frederick Sanger
523    Office of the United Nations High Commissioner...

重复的实体是 Comité international de la Croix Rouge (International Committee of the Red Cross)。我什至检查了它们的布尔比较并得到 True。使用

检查

lucky_winners.iloc[1].full_name == lucky_winners.iloc[3].full_name

我不明白实际问题在哪里。

Answer 1

如果您要查找出现不止一次的所有唯一值，一种方法是使用 np.unique 和可选的 return_counts=True 参数。生成的元组 (unique, counts) 可以组合使用以查找所有计数超过 1 的唯一值：

In [3]: # mash keys to get a series with repeated values
   ...: s = pd.Series(list('abcoiansfaionawiaonwncawowc'))

In [4]: # get unique values and counts
   ...: u, c = np.unique(s, return_counts=True)

In [5]: # find all unique keys with occurrence counts > 1
   ...: u[c > 1]
Out[5]: array(['a', 'c', 'i', 'n', 'o', 'w'], dtype=object)

Answer 2

所以，为了得到所有的重复而不重复，我所做的是（先再读一遍问题）：

得到所有出现次数超过一次的重复项

lucky_winners = df[df.duplicated(['full_name'])]
然后从这个新创建的 DataFrame 中删除重复项

lucky_winners.drop_duplicates(subset = ['full_name'], inplace=True)

就是这样！这样我得到了所有重复的行而没有重复

Pandas Duplicate() return 除一行外所有重复一次

Pandas Duplicate() return all duplicates one time except one row

python

duplicates

dataframe

pandas