Python - 计算数据帧的滚动模式

Python - Calculating a rolling mode on a dataframe

我有一个报告特定日期值的数据集,然后可以在后续日期更新该数据集,从而为每个 [=15= 创建 2 列,DateReported_Date ].有一个单独的 ID 字段是我的数据框的索引。我想计算最近 5 个报告日期的众数和最大值。我知道我可以使用 dataset['Reported_Value'].rolling(5).max() 来计算最大值,但是尝试使用模式滚动会导致错误 'Rolling' object has no attribute 'mode'。有谁知道这是如何实现的?还有一种方法可以让它只计算一个日期吗?所以 2021-12-02 的前几个值没有使用 2021-12-01 值?

示例数据帧:

    ID    Date          Reported_Date    Reported_Value    Max_Last_5_Reported_Days
     1    2021-12-01    2021-12-10                   5                 NaN
     2    2021-12-01    2021-12-11                   6                 NaN
     3    2021-12-01    2021-12-12                   5                 NaN
     4    2021-12-01    2021-12-13                   3                 NaN
     5    2021-12-01    2021-12-14                   2                 6
     6    2021-12-01    2021-12-15                   11                11
     7    2021-12-01    2021-12-16                   7                 11
     8    2021-12-01    2021-12-17                   5                 11
     9    2021-12-01    2021-12-18                   6                 11
     10   2021-12-01    2021-12-19                   7                 11
     11   2021-12-02    2021-12-10                   2                 7
     12   2021-12-02    2021-12-11                   3                 7
     13   2021-12-02    2021-12-12                   2                 7
     14   2021-12-02    2021-12-13                   4                 7
     15   2021-12-02    2021-12-14                   4                 4
     16   2021-12-02    2021-12-15                   4                 4
     17   2021-12-02    2021-12-16                   3                 4
     18   2021-12-02    2021-12-17                   4                 4
     19   2021-12-02    2021-12-18                   2                 4
     20   2021-12-02    2021-12-19                   4                 4

所需的数据帧:

    ID    Date          Reported_Date    Reported_Value    Max_Last_5_Report_Days   Mode_L5RD
     1    2021-12-01    2021-12-10                   5                 NaN             NaN
     2    2021-12-01    2021-12-11                   6                 NaN             NaN
     3    2021-12-01    2021-12-12                   5                 NaN             NaN
     4    2021-12-01    2021-12-13                   3                 NaN             NaN
     5    2021-12-01    2021-12-14                   2                 6               5
     6    2021-12-01    2021-12-15                   11                11              NaN
     7    2021-12-01    2021-12-16                   6                 11              NaN
     8    2021-12-01    2021-12-17                   5                 11              NaN
     9    2021-12-01    2021-12-18                   6                 11              6
     10   2021-12-01    2021-12-19                   6                 11              6
     11   2021-12-02    2021-12-10                   2                 NaN             NaN
     12   2021-12-02    2021-12-11                   3                 NaN             NaN
     13   2021-12-02    2021-12-12                   2                 NaN             NaN
     14   2021-12-02    2021-12-13                   4                 NaN             NaN
     15   2021-12-02    2021-12-14                   4                 4               4
     16   2021-12-02    2021-12-15                   4                 4               4
     17   2021-12-02    2021-12-16                   3                 4               4
     18   2021-12-02    2021-12-17                   4                 4               4
     19   2021-12-02    2021-12-18                   2                 4               4
     20   2021-12-02    2021-12-19                   4                 4               4

我不确定如何表达存在多个模式值,因此在示例中将它们列为 NaN。

模式不是预定义函数,但是您可以使用 rolling(5).apply(custom_function) 应用自定义函数。对于你的情况可能是

dataset['Reported_Value'].rolling(5).apply(lamba s: s.mode())

groupby“日期”并使用 rolling_max 作为最近 5 天的最大值;应用 scipy.stats.mode 模式:

from scipy.stats import mode
rolling_obj = df.groupby('Date')['Reported_Value'].rolling(5)
df['Max_Last_5_Report_Days'] = rolling_obj.max().droplevel(0)
df['Mode_L5RD'] = rolling_obj.apply(lambda x: mode(x)[0]).droplevel(0)

输出:

    ID        Date Reported_Date  Reported_Value  Max_Last_5_Reported_Days  \
0    1  2021-12-01    2021-12-10               5                       NaN   
1    2  2021-12-01    2021-12-11               6                       NaN   
2    3  2021-12-01    2021-12-12               5                       NaN   
3    4  2021-12-01    2021-12-13               3                       NaN   
4    5  2021-12-01    2021-12-14               2                       6.0   
5    6  2021-12-01    2021-12-15              11                      11.0   
6    7  2021-12-01    2021-12-16               7                      11.0   
7    8  2021-12-01    2021-12-17               5                      11.0   
8    9  2021-12-01    2021-12-18               6                      11.0   
9   10  2021-12-01    2021-12-19               7                      11.0   
10  11  2021-12-02    2021-12-10               2                       7.0   
11  12  2021-12-02    2021-12-11               3                       7.0   
12  13  2021-12-02    2021-12-12               2                       7.0   
13  14  2021-12-02    2021-12-13               4                       7.0   
14  15  2021-12-02    2021-12-14               4                       4.0   
15  16  2021-12-02    2021-12-15               4                       4.0   
16  17  2021-12-02    2021-12-16               3                       4.0   
17  18  2021-12-02    2021-12-17               4                       4.0   
18  19  2021-12-02    2021-12-18               2                       4.0   
19  20  2021-12-02    2021-12-19               4                       4.0   

    Max_Last_5_Report_Days  Mode_L5RD  
0                      NaN        NaN  
1                      NaN        NaN  
2                      NaN        NaN  
3                      NaN        NaN  
4                      6.0        5.0  
5                     11.0        2.0  
6                     11.0        2.0  
7                     11.0        2.0  
8                     11.0        2.0  
9                     11.0        7.0  
10                     NaN        NaN  
11                     NaN        NaN  
12                     NaN        NaN  
13                     NaN        NaN  
14                     4.0        2.0  
15                     4.0        4.0  
16                     4.0        4.0  
17                     4.0        4.0  
18                     4.0        4.0  
19                     4.0        4.0