Python Group By and Counting first in Series, 按月排序

Question

我有一个 pandas 数据框（这是一个例子，实际数据框要大很多）：

data = [['345', 1, '2022_Jan'], ['678', 1, '2022_Jan'], ['123', 1, '2022_Feb'], ['123', 1, '2022_Feb'], ['345', 0, '2022_Mar'], ['678', 1, '2022_Mar'], ['901', 0, '2022_Mar'], ['678', 1, '2022_Mar']]

df = pd.DataFrame(data, columns = ['ID', 'Error Count', 'Year_Month'])

我要回答的问题是：有多少个ID有误？

我想要得到一个输出，该输出按 'Year_Month' 分组，并且每个月出现的每个 ID 计数为 1。换句话说，我想在一个月内为每个 ID 只计算 1。

当我按 'Year_Month' & 'ID' 分组时：df.groupby(['Year_Month', 'ID']).count()

它将给我以下输出（下面的当前输出 link）以及每个 ID 的总错误计数，但我只想对每个 ID 计数一次。我还希望 Year_Month 按时间顺序排序，不知道为什么当我的原始数据框在 Year_Month 列中按月排序时不是这样。

My current output

Desired output

Answer 1

这是一种方法：

(df
    .groupby(['Year_Month', 'ID']) # group by the two columns
    .sum('Error Count')['Error Count'] # aggregate the sum over error count
    .apply(lambda x: int(bool(x)))) # convert to boolean and back to int
    .to_frame('Error Count') # add name back to applied column
)

                Error Count
Year_Month ID
2022_Feb   123            1
2022_Jan   345            1
           678            1
2022_Mar   345            0
           678            1
           901            0

Answer 2

这些实际上是重复记录吗？您确定不想记录用户 123 在 2 月份有两次错误吗？

如果是这样，首先删除重复项，然后分组并求和Error Count。 .count() 方法并不像您认为的那样：

df.drop_duplicates(["ID", "Year_Month"]) \
  .groupby(["Year_Month", "ID"])["Error Count"] \
  .sum()

输出：

In [3]: counts = df.drop_duplicates(["ID", "Year_Month"]) \
   ...:            .groupby(["Year_Month", "ID"])["Error Count"] \
   ...:            .sum()

In [4]: counts
Out[4]:
Year_Month  ID
2022_Feb    123    1
2022_Jan    345    1
            678    1
2022_Mar    345    0
            678    1
            901    0
Name: Error Count, dtype: int64

就排序而言，您希望将 "Year_Month" 转换为日期时间对象，因为现在它们只是作为字符串排序：

In [5]: "2022_Feb" < "2022_Jan"
Out[5]: True

你可以这样做：

In [6]: counts.sort_index(level=0, key=lambda ym: pd.to_datetime(ym, format="%Y_%b"))
Out[6]:
Year_Month  ID
2022_Jan    345    1
            678    1
2022_Feb    123    1
2022_Mar    345    0
            678    1
            901    0
Name: Error Count, dtype: int64

Answer 3

这是另一种方法

使用 astype(bool) 将总和转换为布尔值 return True 或 False，基于值为 0 或 non-zero，然后使用 astype(int)


df.groupby(['Year_Month','ID'])['Error Count'].sum().astype(bool).astype(int)

Year_Month  ID 
2022_Feb    123    1
2022_Jan    345    1
            678    1
2022_Mar    345    0
            678    1
            901    0
Name: Error Count, dtype: int32

要排序，请将结果分配给数据框，然后应用 ddejohn 解决方案进行排序

counts = df.groupby(['Year_Month','ID'])['Error Count'].sum().astype(bool).astype(int)

counts.sort_index(level=0, key=lambda ym: pd.to_datetime(ym, format="%Y_%b")) # ddejohn:  answer above


Year_Month  ID 
2022_Jan    345    1
            678    1
2022_Feb    123    1
2022_Mar    345    0
            678    1
            901    0
Name: Error Count, dtype: int32

Python Group By and Counting first in Series, 按月排序

Python Groupby & Counting first in Series, sorting by month

python

pandas-groupby