基于日期的滚动分组总和未返回预期结果

Sum of rolling groupby based on date not returning expected results

我有以下数据框:

            match_date  player_id  player__match_count__won
1  2012-09-25 10:00:00      23640                         1
18 2012-09-25 10:00:00      17969                         0
2  2012-09-26 16:00:00      17268                         1
19 2012-09-26 16:00:00      11247                         0
0  2012-09-25 00:00:00      23640                         1
17 2012-09-25 00:00:00      17268                         0
3  2012-09-29 00:00:00      12202                         1
20 2012-09-29 00:00:00      23640                         0
5  2012-10-02 14:20:00       9241                         1
22 2012-10-02 14:20:00      23640                         0
6  2012-10-03 11:20:00      17268                         1
23 2012-10-03 11:20:00      23129                         0
4  2012-10-02 00:00:00      17268                         1
21 2012-10-02 00:00:00       8515                         0
7  2012-10-05 12:20:00      17268                         1
24 2012-10-05 12:20:00       9241                         0
8  2012-10-06 12:00:00      17268                         1
25 2012-10-06 12:00:00        940                         0
9  2012-10-07 11:00:00       6764                         1
26 2012-10-07 11:00:00      17268                         0
10 2012-10-15 00:00:00      23640                         1
27 2012-10-15 00:00:00      26418                         0
11 2012-10-16 00:00:00      23640                         1
28 2012-10-16 00:00:00       6764                         0
12 2012-10-19 14:00:00      14527                         1
29 2012-10-19 14:00:00      23640                         0
13 2012-10-20 13:20:00        469                         1
30 2012-10-20 13:20:00      17268                         0
14 2012-10-26 00:00:00      23640                         1
31 2012-10-26 00:00:00      11247                         0
15 2012-10-27 00:00:00       7507                         1
32 2012-10-27 00:00:00      23640                         0
16 2012-10-29 17:15:00         72                         1
33 2012-10-29 17:15:00      17268                         0

我想获得 player__match_count__won 滚动 30 天期间的总和(按玩家 ID 分组,不包括滚动 window 中的当前行)。我将下面的内容用于 return DataFrame,因为在真实数据中,我将对多列进行求和:

df.groupby("player_id").rolling("30d", on="match_date", closed="left").sum().sort_values("match_date")

但是,我得到以下信息:

    player_id          match_date  player__match_count__won
0       23640 2012-09-25 00:00:00                       5.0 <- incorrect
17      17268 2012-09-25 00:00:00                       5.0 <- incorrect
1       23640 2012-09-25 10:00:00                       NaN
18      17969 2012-09-25 10:00:00                       NaN
2       17268 2012-09-26 16:00:00                       NaN
19      11247 2012-09-26 16:00:00                       NaN
20      23640 2012-09-29 00:00:00                       2.0
3       12202 2012-09-29 00:00:00                       NaN
4       17268 2012-10-02 00:00:00                       1.0
21       8515 2012-10-02 00:00:00                       NaN
22      23640 2012-10-02 14:20:00                       2.0
5        9241 2012-10-02 14:20:00                       NaN
6       17268 2012-10-03 11:20:00                       1.0
23      23129 2012-10-03 11:20:00                       NaN
24       9241 2012-10-05 12:20:00                       1.0
7       17268 2012-10-05 12:20:00                       3.0
8       17268 2012-10-06 12:00:00                       4.0
25        940 2012-10-06 12:00:00                       NaN
9        6764 2012-10-07 11:00:00                       NaN
26      17268 2012-10-07 11:00:00                       5.0
10      23640 2012-10-15 00:00:00                       2.0
27      26418 2012-10-15 00:00:00                       NaN
28       6764 2012-10-16 00:00:00                       1.0
11      23640 2012-10-16 00:00:00                       3.0
12      14527 2012-10-19 14:00:00                       NaN
29      23640 2012-10-19 14:00:00                       4.0
30      17268 2012-10-20 13:20:00                       5.0
13        469 2012-10-20 13:20:00                       NaN
31      11247 2012-10-26 00:00:00                       0.0
14      23640 2012-10-26 00:00:00                       2.0
32      23640 2012-10-27 00:00:00                       3.0
15       7507 2012-10-27 00:00:00                       NaN
33      17268 2012-10-29 17:15:00                       4.0
16         72 2012-10-29 17:15:00                       NaN

如果我预先在 match_date 上添加一个排序:

df.sort_values("match_date").groupby("player_id").rolling("30d", on="match_date", closed="left").sum().sort_values("match_date")

然后我得到正确的值:

    player_id          match_date  player__match_count__won
17      17268 2012-09-25 00:00:00                       NaN
0       23640 2012-09-25 00:00:00                       NaN
1       23640 2012-09-25 10:00:00                       1.0
18      17969 2012-09-25 10:00:00                       NaN
19      11247 2012-09-26 16:00:00                       NaN
2       17268 2012-09-26 16:00:00                       0.0
20      23640 2012-09-29 00:00:00                       2.0
3       12202 2012-09-29 00:00:00                       NaN
21       8515 2012-10-02 00:00:00                       NaN
4       17268 2012-10-02 00:00:00                       1.0
22      23640 2012-10-02 14:20:00                       2.0
5        9241 2012-10-02 14:20:00                       NaN
23      23129 2012-10-03 11:20:00                       NaN
6       17268 2012-10-03 11:20:00                       2.0
24       9241 2012-10-05 12:20:00                       1.0
7       17268 2012-10-05 12:20:00                       3.0
8       17268 2012-10-06 12:00:00                       4.0
25        940 2012-10-06 12:00:00                       NaN
9        6764 2012-10-07 11:00:00                       NaN
26      17268 2012-10-07 11:00:00                       5.0
10      23640 2012-10-15 00:00:00                       2.0
27      26418 2012-10-15 00:00:00                       NaN
28       6764 2012-10-16 00:00:00                       1.0
11      23640 2012-10-16 00:00:00                       3.0
12      14527 2012-10-19 14:00:00                       NaN
29      23640 2012-10-19 14:00:00                       4.0
30      17268 2012-10-20 13:20:00                       5.0
13        469 2012-10-20 13:20:00                       NaN
31      11247 2012-10-26 00:00:00                       0.0
14      23640 2012-10-26 00:00:00                       2.0
32      23640 2012-10-27 00:00:00                       3.0
15       7507 2012-10-27 00:00:00                       NaN
33      17268 2012-10-29 17:15:00                       4.0
16         72 2012-10-29 17:15:00                       NaN

但是,DataFrame 在原始示例中已按 match_date 排序。

为什么会这样?

为了便于复制,这里是字典形式的 Dataframe:

from pandas import Timestamp

df = {
    'match_date': {1: Timestamp('2012-09-25 10:00:00'), 18: Timestamp('2012-09-25 10:00:00'), 2: Timestamp('2012-09-26 16:00:00'), 19: Timestamp('2012-09-26 16:00:00'), 0: Timestamp('2012-09-25 00:00:00'), 17: Timestamp('2012-09-25 00:00:00'), 3: Timestamp('2012-09-29 00:00:00'), 20: Timestamp('2012-09-29 00:00:00'), 5: Timestamp('2012-10-02 14:20:00'), 22: Timestamp('2012-10-02 14:20:00'), 6: Timestamp('2012-10-03 11:20:00'), 23: Timestamp('2012-10-03 11:20:00'), 4: Timestamp('2012-10-02 00:00:00'), 21: Timestamp('2012-10-02 00:00:00'), 7: Timestamp('2012-10-05 12:20:00'), 24: Timestamp('2012-10-05 12:20:00'), 8: Timestamp('2012-10-06 12:00:00'), 25: Timestamp('2012-10-06 12:00:00'), 9: Timestamp('2012-10-07 11:00:00'), 26: Timestamp('2012-10-07 11:00:00'), 10: Timestamp('2012-10-15 00:00:00'), 27: Timestamp('2012-10-15 00:00:00'), 11: Timestamp('2012-10-16 00:00:00'), 28: Timestamp('2012-10-16 00:00:00'), 12: Timestamp('2012-10-19 14:00:00'), 29: Timestamp('2012-10-19 14:00:00'), 13: Timestamp('2012-10-20 13:20:00'), 30: Timestamp('2012-10-20 13:20:00'), 14: Timestamp('2012-10-26 00:00:00'), 31: Timestamp('2012-10-26 00:00:00'), 15: Timestamp('2012-10-27 00:00:00'), 32: Timestamp('2012-10-27 00:00:00'), 16: Timestamp('2012-10-29 17:15:00'), 33: Timestamp('2012-10-29 17:15:00')}, 
    'player_id': {1: 23640, 18: 17969, 2: 17268, 19: 11247, 0: 23640, 17: 17268, 3: 12202, 20: 23640, 5: 9241, 22: 23640, 6: 17268, 23: 23129, 4: 17268, 21: 8515, 7: 17268, 24: 9241, 8: 17268, 25: 940, 9: 6764, 26: 17268, 10: 23640, 27: 26418, 11: 23640, 28: 6764, 12: 14527, 29: 23640, 13: 469, 30: 17268, 14: 23640, 31: 11247, 15: 7507, 32: 23640, 16: 72, 33: 17268}, 
    'player__match_count__won': {1: 1, 18: 0, 2: 1, 19: 0, 0: 1, 17: 0, 3: 1, 20: 0, 5: 1, 22: 0, 6: 1, 23: 0, 4: 1, 21: 0, 7: 1, 24: 0, 8: 1, 25: 0, 9: 1, 26: 0, 10: 1, 27: 0, 11: 1, 28: 0, 12: 1, 29: 0, 13: 1, 30: 0, 14: 1, 31: 0, 15: 1, 32: 0, 16: 1, 33: 0}, 
}
df = pd.DataFrame(df)

你说

However, the DataFrame was already sorted by match_date in the original example

但是在检查了您在 post 末尾提供的数据后,我发现有一些 2012-09-25 日期即将 2012-09-26 日期,还有一些 2012-10-02 日期 一些 2012-10-03 日期之后。

您的数据最初排序不正确。所以 df.sort_values("match_date") 让它变得不同,这就是为什么你会得到不同的结果。