在 python 的新列中计算 window 时间内的类别重复项（类似于滚动 value_counts）

Question

一段时间以来，我一直在尝试解决一个练习，但我一直没能做到，我有一个数据集，其中包含带有调用主题的调用列表（在此示例数据集中，我决定use ice cream flavors as topics），在呼叫中心，如果该主题在另一次呼叫中未提及，则认为该主题已在第一时间解决，时间 window 为 72 小时。我需要在数据框中创建一个新列，用于计算在 72 小时的 window 内提及该行中的冰淇淋口味的次数（计算一个事件在一段时间内发生的次数 window).

我看到了一个使用 get_dummies 的解决方案，但这对我来说效率很低，因为我有 300 多种冰淇淋口味：

以下是我手头的数据样本：

2014-01-01 07:21:51 Apple
2014-01-01 10:00:47 Orange
2014-01-01 13:24:58 Banana
2014-01-01 15:05:22 Strawberry
2014-01-01 23:26:55 Lemon
2014-01-02 10:07:15 Orange
2014-01-02 10:57:23 Banana
2014-01-03 06:32:11 Peach
2014-01-03 11:29:02 Orange
2014-01-03 19:07:37 Coconut
2014-01-03 19:39:53 Mango
2014-01-04 00:02:36 Grape
2014-01-04 06:51:53 Cherry
2014-01-04 07:53:01 Strawberry
2014-01-04 08:57:48 Coconut

这是预期的结果：

2014-01-01 07:21:51 Apple   1
2014-01-01 10:00:47 Orange  1
2014-01-01 13:24:58 Banana  1
2014-01-01 15:05:22 Strawberry  1
2014-01-01 23:26:55 Lemon   1
2014-01-02 10:07:15 Orange  2
2014-01-02 10:57:23 Banana  2
2014-01-03 06:32:11 Peach   1
2014-01-03 11:29:02 Orange  3
2014-01-03 19:07:37 Coconut 1
2014-01-03 19:39:53 Mango   1
2014-01-04 00:02:36 Grape   1
2014-01-04 06:51:53 Cherry  1
2014-01-04 07:53:01 Strawberry  2
2014-01-04 08:57:48 Coconut 2

我发现了一些类似的问题，但没有完全解决我的需求：

Rolling count pandas for categorical variables using time

Answer 1

添加的列count作为临时帮手，我们可以对其求和。

设置：

df = pd.read_csv("data.csv")
df["date"] = pd.to_datetime(df["date"])
df.set_index("date", inplace=True)
df["count"] = 1

用法：

result = df.groupby("flavor").rolling("72H").sum().reset_index()
df = df.merge(result, on=["flavor", "date"], suffixes=("_old", ""))
del df["count_old"]
df.to_markdown()

输出：

|    | flavor     | date                |   count |
|---:|:-----------|:--------------------|--------:|
|  0 | Apple      | 2014-01-01 07:21:51 |       1 |
|  1 | Orange     | 2014-01-01 10:00:47 |       1 |
|  2 | Banana     | 2014-01-01 13:24:58 |       1 |
|  3 | Strawberry | 2014-01-01 15:05:22 |       1 |
|  4 | Lemon      | 2014-01-01 23:26:55 |       1 |
|  5 | Orange     | 2014-01-02 10:07:15 |       2 |
|  6 | Banana     | 2014-01-02 10:57:23 |       2 |
|  7 | Peach      | 2014-01-03 06:32:11 |       1 |
|  8 | Orange     | 2014-01-03 11:29:02 |       3 |
|  9 | Coconut    | 2014-01-03 19:07:37 |       1 |
| 10 | Mango      | 2014-01-03 19:39:53 |       1 |
| 11 | Grape      | 2014-01-04 00:02:36 |       1 |
| 12 | Cherry     | 2014-01-04 06:51:53 |       1 |
| 13 | Strawberry | 2014-01-04 07:53:01 |       2 |
| 14 | Coconut    | 2014-01-04 08:57:48 |       2 |

在 python 的新列中计算 window 时间内的类别重复项（类似于滚动 value_counts）

Count category duplicates within a time window in new column in python (similar to rolling with value_counts)

python

pandas

rolling-computation

pandas-rolling