Python - 计算数据帧的滚动模式
Python - Calculating a rolling mode on a dataframe
我有一个报告特定日期值的数据集,然后可以在后续日期更新该数据集,从而为每个 [=15= 创建 2 列,Date
和 Reported_Date
].有一个单独的 ID
字段是我的数据框的索引。我想计算最近 5 个报告日期的众数和最大值。我知道我可以使用 dataset['Reported_Value'].rolling(5).max()
来计算最大值,但是尝试使用模式滚动会导致错误 'Rolling' object has no attribute 'mode'
。有谁知道这是如何实现的?还有一种方法可以让它只计算一个日期吗?所以 2021-12-02
的前几个值没有使用 2021-12-01
值?
示例数据帧:
ID Date Reported_Date Reported_Value Max_Last_5_Reported_Days
1 2021-12-01 2021-12-10 5 NaN
2 2021-12-01 2021-12-11 6 NaN
3 2021-12-01 2021-12-12 5 NaN
4 2021-12-01 2021-12-13 3 NaN
5 2021-12-01 2021-12-14 2 6
6 2021-12-01 2021-12-15 11 11
7 2021-12-01 2021-12-16 7 11
8 2021-12-01 2021-12-17 5 11
9 2021-12-01 2021-12-18 6 11
10 2021-12-01 2021-12-19 7 11
11 2021-12-02 2021-12-10 2 7
12 2021-12-02 2021-12-11 3 7
13 2021-12-02 2021-12-12 2 7
14 2021-12-02 2021-12-13 4 7
15 2021-12-02 2021-12-14 4 4
16 2021-12-02 2021-12-15 4 4
17 2021-12-02 2021-12-16 3 4
18 2021-12-02 2021-12-17 4 4
19 2021-12-02 2021-12-18 2 4
20 2021-12-02 2021-12-19 4 4
所需的数据帧:
ID Date Reported_Date Reported_Value Max_Last_5_Report_Days Mode_L5RD
1 2021-12-01 2021-12-10 5 NaN NaN
2 2021-12-01 2021-12-11 6 NaN NaN
3 2021-12-01 2021-12-12 5 NaN NaN
4 2021-12-01 2021-12-13 3 NaN NaN
5 2021-12-01 2021-12-14 2 6 5
6 2021-12-01 2021-12-15 11 11 NaN
7 2021-12-01 2021-12-16 6 11 NaN
8 2021-12-01 2021-12-17 5 11 NaN
9 2021-12-01 2021-12-18 6 11 6
10 2021-12-01 2021-12-19 6 11 6
11 2021-12-02 2021-12-10 2 NaN NaN
12 2021-12-02 2021-12-11 3 NaN NaN
13 2021-12-02 2021-12-12 2 NaN NaN
14 2021-12-02 2021-12-13 4 NaN NaN
15 2021-12-02 2021-12-14 4 4 4
16 2021-12-02 2021-12-15 4 4 4
17 2021-12-02 2021-12-16 3 4 4
18 2021-12-02 2021-12-17 4 4 4
19 2021-12-02 2021-12-18 2 4 4
20 2021-12-02 2021-12-19 4 4 4
我不确定如何表达存在多个模式值,因此在示例中将它们列为 NaN。
模式不是预定义函数,但是您可以使用 rolling(5).apply(custom_function)
应用自定义函数。对于你的情况可能是
dataset['Reported_Value'].rolling(5).apply(lamba s: s.mode())
groupby
“日期”并使用 rolling_max
作为最近 5 天的最大值;应用 scipy.stats.mode
模式:
from scipy.stats import mode
rolling_obj = df.groupby('Date')['Reported_Value'].rolling(5)
df['Max_Last_5_Report_Days'] = rolling_obj.max().droplevel(0)
df['Mode_L5RD'] = rolling_obj.apply(lambda x: mode(x)[0]).droplevel(0)
输出:
ID Date Reported_Date Reported_Value Max_Last_5_Reported_Days \
0 1 2021-12-01 2021-12-10 5 NaN
1 2 2021-12-01 2021-12-11 6 NaN
2 3 2021-12-01 2021-12-12 5 NaN
3 4 2021-12-01 2021-12-13 3 NaN
4 5 2021-12-01 2021-12-14 2 6.0
5 6 2021-12-01 2021-12-15 11 11.0
6 7 2021-12-01 2021-12-16 7 11.0
7 8 2021-12-01 2021-12-17 5 11.0
8 9 2021-12-01 2021-12-18 6 11.0
9 10 2021-12-01 2021-12-19 7 11.0
10 11 2021-12-02 2021-12-10 2 7.0
11 12 2021-12-02 2021-12-11 3 7.0
12 13 2021-12-02 2021-12-12 2 7.0
13 14 2021-12-02 2021-12-13 4 7.0
14 15 2021-12-02 2021-12-14 4 4.0
15 16 2021-12-02 2021-12-15 4 4.0
16 17 2021-12-02 2021-12-16 3 4.0
17 18 2021-12-02 2021-12-17 4 4.0
18 19 2021-12-02 2021-12-18 2 4.0
19 20 2021-12-02 2021-12-19 4 4.0
Max_Last_5_Report_Days Mode_L5RD
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 6.0 5.0
5 11.0 2.0
6 11.0 2.0
7 11.0 2.0
8 11.0 2.0
9 11.0 7.0
10 NaN NaN
11 NaN NaN
12 NaN NaN
13 NaN NaN
14 4.0 2.0
15 4.0 4.0
16 4.0 4.0
17 4.0 4.0
18 4.0 4.0
19 4.0 4.0
我有一个报告特定日期值的数据集,然后可以在后续日期更新该数据集,从而为每个 [=15= 创建 2 列,Date
和 Reported_Date
].有一个单独的 ID
字段是我的数据框的索引。我想计算最近 5 个报告日期的众数和最大值。我知道我可以使用 dataset['Reported_Value'].rolling(5).max()
来计算最大值,但是尝试使用模式滚动会导致错误 'Rolling' object has no attribute 'mode'
。有谁知道这是如何实现的?还有一种方法可以让它只计算一个日期吗?所以 2021-12-02
的前几个值没有使用 2021-12-01
值?
示例数据帧:
ID Date Reported_Date Reported_Value Max_Last_5_Reported_Days
1 2021-12-01 2021-12-10 5 NaN
2 2021-12-01 2021-12-11 6 NaN
3 2021-12-01 2021-12-12 5 NaN
4 2021-12-01 2021-12-13 3 NaN
5 2021-12-01 2021-12-14 2 6
6 2021-12-01 2021-12-15 11 11
7 2021-12-01 2021-12-16 7 11
8 2021-12-01 2021-12-17 5 11
9 2021-12-01 2021-12-18 6 11
10 2021-12-01 2021-12-19 7 11
11 2021-12-02 2021-12-10 2 7
12 2021-12-02 2021-12-11 3 7
13 2021-12-02 2021-12-12 2 7
14 2021-12-02 2021-12-13 4 7
15 2021-12-02 2021-12-14 4 4
16 2021-12-02 2021-12-15 4 4
17 2021-12-02 2021-12-16 3 4
18 2021-12-02 2021-12-17 4 4
19 2021-12-02 2021-12-18 2 4
20 2021-12-02 2021-12-19 4 4
所需的数据帧:
ID Date Reported_Date Reported_Value Max_Last_5_Report_Days Mode_L5RD
1 2021-12-01 2021-12-10 5 NaN NaN
2 2021-12-01 2021-12-11 6 NaN NaN
3 2021-12-01 2021-12-12 5 NaN NaN
4 2021-12-01 2021-12-13 3 NaN NaN
5 2021-12-01 2021-12-14 2 6 5
6 2021-12-01 2021-12-15 11 11 NaN
7 2021-12-01 2021-12-16 6 11 NaN
8 2021-12-01 2021-12-17 5 11 NaN
9 2021-12-01 2021-12-18 6 11 6
10 2021-12-01 2021-12-19 6 11 6
11 2021-12-02 2021-12-10 2 NaN NaN
12 2021-12-02 2021-12-11 3 NaN NaN
13 2021-12-02 2021-12-12 2 NaN NaN
14 2021-12-02 2021-12-13 4 NaN NaN
15 2021-12-02 2021-12-14 4 4 4
16 2021-12-02 2021-12-15 4 4 4
17 2021-12-02 2021-12-16 3 4 4
18 2021-12-02 2021-12-17 4 4 4
19 2021-12-02 2021-12-18 2 4 4
20 2021-12-02 2021-12-19 4 4 4
我不确定如何表达存在多个模式值,因此在示例中将它们列为 NaN。
模式不是预定义函数,但是您可以使用 rolling(5).apply(custom_function)
应用自定义函数。对于你的情况可能是
dataset['Reported_Value'].rolling(5).apply(lamba s: s.mode())
groupby
“日期”并使用 rolling_max
作为最近 5 天的最大值;应用 scipy.stats.mode
模式:
from scipy.stats import mode
rolling_obj = df.groupby('Date')['Reported_Value'].rolling(5)
df['Max_Last_5_Report_Days'] = rolling_obj.max().droplevel(0)
df['Mode_L5RD'] = rolling_obj.apply(lambda x: mode(x)[0]).droplevel(0)
输出:
ID Date Reported_Date Reported_Value Max_Last_5_Reported_Days \
0 1 2021-12-01 2021-12-10 5 NaN
1 2 2021-12-01 2021-12-11 6 NaN
2 3 2021-12-01 2021-12-12 5 NaN
3 4 2021-12-01 2021-12-13 3 NaN
4 5 2021-12-01 2021-12-14 2 6.0
5 6 2021-12-01 2021-12-15 11 11.0
6 7 2021-12-01 2021-12-16 7 11.0
7 8 2021-12-01 2021-12-17 5 11.0
8 9 2021-12-01 2021-12-18 6 11.0
9 10 2021-12-01 2021-12-19 7 11.0
10 11 2021-12-02 2021-12-10 2 7.0
11 12 2021-12-02 2021-12-11 3 7.0
12 13 2021-12-02 2021-12-12 2 7.0
13 14 2021-12-02 2021-12-13 4 7.0
14 15 2021-12-02 2021-12-14 4 4.0
15 16 2021-12-02 2021-12-15 4 4.0
16 17 2021-12-02 2021-12-16 3 4.0
17 18 2021-12-02 2021-12-17 4 4.0
18 19 2021-12-02 2021-12-18 2 4.0
19 20 2021-12-02 2021-12-19 4 4.0
Max_Last_5_Report_Days Mode_L5RD
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 6.0 5.0
5 11.0 2.0
6 11.0 2.0
7 11.0 2.0
8 11.0 2.0
9 11.0 7.0
10 NaN NaN
11 NaN NaN
12 NaN NaN
13 NaN NaN
14 4.0 2.0
15 4.0 4.0
16 4.0 4.0
17 4.0 4.0
18 4.0 4.0
19 4.0 4.0