python dataframe .duplicated returns 相同值多次出现

Question

给定以下数据框：

import pandas as pd

df = pd.DataFrame({'month': [2, 2, 1, 1, 2, 10],
                   'year': [2017, 2017, 2020, 2020, 2018, 2019],
                   'sale': [60, 45, 90, 20, 28, 36],
                   'title': ['Ones', 'Twoes', 'Three', 'Four', 'Five', 'Six']})

我正在尝试在 month 列中获取重复项。

df[df.duplicated(subset=['month'])]

默认情况下，keep="first"

但这给出了 2 月份的两次出现。

   month  year  sale  title
1      2  2017    45  Twoes
3      1  2020    20   Four
4      2  2018    28   Five

我对输出感到困惑。我在这里遗漏了什么吗？

Answer 1

输出是过滤所有重复项并删除第一个重复项。

如果需要首先使用 keep=False 参数仅复制反转掩码和链式掩码以仅复制过滤器：

df1 = df[~df.duplicated(subset=['month']) & df.duplicated(subset=['month'], keep=False)]
print (df1)
   month  year  sale  title
0      2  2017    60   Ones
2      1  2020    90  Three

Answer 2

输出是数据框中的重复值，而不是删除重复项后的值。如果你只想要非重复值那么

df.drop_duplicates(subset=['month'])

这会给你

  month  year   sale title
0   2   2017    60  Ones
2   1   2020    90  Three
5   10  2019    36  Six

您可以根据需要使用 keep = ['first', 'last', 'None']。

python dataframe .duplicated returns 相同值多次出现

python dataframe .duplicated returns multiple occurrences for same value

python

duplicates

dataframe

pandas