计算 Pandas 组的 2 个月滚动平均值
Calculate 2-month rolling average for group in Pandas
我正在尝试使用以下 df:
计算玩家 'score' 的两个月滚动平均值
df = pd.DataFrame({'player_id': [1098, 1098, 1098, 1098, 1116, 1116, 1116, 1116, 1116, 1116, 1116, 1116],
'date': ['2018-06-22', '2018-06-23', '2018-07-24', '2018-07-25',
'2018-07-22', '2018-07-23', '2018-07-24', '2018-07-25',
'2018-08-22', '2018-08-23', '2018-08-24', '2018-08-25'],
'score': [-2,1,2,3,-8,3,2,-3,-2,1,2,3]})
我希望将平均值按 'player_ID' 列分组,这样玩家 1098 每行的平均值分别为 -2、-0.5、0.33 和 1。如果 2020 年只有一个日期,那就是那个值。如果日期是 2020 年 1 月和 2 月,则 1 月行就是该值,而 2 月行将是这两个值的平均值。
您可以按 'player_id'
分组并在 'score'
列上使用 expanding().mean()
方法:
df['rolling mean'] = df.groupby('player_id')['score'].expanding().mean().round(2).droplevel(0)
编辑:
鉴于评论中的新信息,也许您需要 rolling.mean
方法。从 OP 向数据框添加了一些行,以更好地显示正在发生的事情。对于数据框 df
:
df = pd.DataFrame({'player_id': [1098, 1098, 1098, 1098, 1098, 1098, 1098, 1098,
1116, 1116, 1116, 1116, 1116, 1116, 1116, 1116],
'date': ['2018-06-22', '2018-06-23', '2018-07-24', '2018-07-25',
'2019-06-22', '2019-06-25', '2019-07-25', '2020-06-22',
'2018-07-22', '2018-07-23', '2018-07-24', '2018-07-25',
'2018-08-22', '2018-08-23', '2018-08-24', '2018-08-25'],
'score': [-2, 1, 2, 3, 7, 8, 6, 5, -8, 3, 2, -3, -2, 1, 2, 3]})
我们在这里找到每个 'player_id'
:
的 60 天滚动平均值
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(by=['player_id','date'])
df['rolling_mean'] = df.set_index('date').groupby('player_id', sort=False)['score'].rolling('60D').mean().round(2).to_numpy()
输出:
player_id date score rolling_mean
0 1098 2018-06-22 -2 -2.00
1 1098 2018-06-23 1 -0.50
2 1098 2018-07-24 2 0.33
3 1098 2018-07-25 3 1.00
6 1098 2019-06-22 7 7.00
7 1098 2019-06-25 8 7.50
5 1098 2019-07-25 6 7.00
4 1098 2020-06-22 5 5.00
8 1116 2018-07-22 -8 -8.00
9 1116 2018-07-23 3 -2.50
10 1116 2018-07-24 2 -1.00
11 1116 2018-07-25 -3 -1.50
12 1116 2018-08-22 -2 -1.60
13 1116 2018-08-23 1 -1.17
14 1116 2018-08-24 2 -0.71
15 1116 2018-08-25 3 -0.25
不确定这是否是最优雅的,但给你:
import pandas as pd
#DataFrame you provided
df = pd.DataFrame({'player_id': [1098, 1098, 1098, 1098, 1116, 1116, 1116, 1116, 1116, 1116, 1116, 1116],
'date': ['2018-06-22', '2018-06-23', '2018-07-24', '2018-07-25',
'2018-07-22', '2018-07-23', '2018-07-24', '2018-07-25',
'2018-08-22', '2018-08-23', '2018-08-24', '2018-08-25'],
'score': [-2,1,2,3,-8,3,2,-3,-2,1,2,3]})
#As best practice, set date strings to date type
df['date'] = pd.to_datetime(df['date'])
"""
Group by player and date, then add a rolling average with minimum level 1
(meaning you can start rolling from the first row). The rolling function creates a series
from your dataframe
"""
rolling_average_series = df.groupby(by=['player_id','date']).rolling(1)['score'].mean()
#Reset series as dataframe with .to_frame then reset index so that player_id and date are still columns
df_grouped = rolling_average_series.to_frame().reset_index(level=['player_id','date'])
对于奇怪的格式表示歉意。
df = pd.DataFrame({'player_id': [1098, 1098, 1098, 1098, 1116, 1116, 1116, 1116, 1116, 1116, 1116, 1116],
'date': ['2018-06-22', '2018-06-23', '2018-07-24', '2018-07-25',
'2018-07-22', '2018-07-23', '2018-07-24', '2018-07-25',
'2018-08-22', '2018-08-23', '2018-08-24', '2018-08-25'],
'score': [-2,1,2,3,-8,3,2,-3,-2,1,2,3]})
window=4
min_periods=3
cap=10
def get_month(date):
return date.month
df=df.sort_values(by='date')
df['date']=pd.to_datetime(df['date'])
df['month']=df['date'].apply(get_month)
df['elapsed_days'] = (df['date']-df['date'].shift(1)).dt.days
df['accumulative_elapsed_days'] = df['elapsed_days'].cumsum()
df['rolling_mean'] = df['score'].rolling(window, min_periods).mean().round(2)
print(df[['date','score','month','elapsed_days','accum_elapsed_days','rolling_mean']])
df.set_index('date',inplace=True)
df[['score','rolling_mean']].plot()
plt.show()
输出:
date score month elapsed_days accum_elapsed_days rolling_mean
0 2018-06-22 -2 6 NaN NaN NaN
1 2018-06-23 1 6 1.0 1.0 NaN
4 2018-07-22 -8 7 29.0 30.0 -3.00
5 2018-07-23 3 7 1.0 31.0 -1.50
2 2018-07-24 2 7 1.0 32.0 -0.50
6 2018-07-24 2 7 0.0 32.0 -0.25
3 2018-07-25 3 7 1.0 33.0 2.50
7 2018-07-25 -3 7 0.0 33.0 1.00
8 2018-08-22 -2 8 28.0 61.0 0.00
9 2018-08-23 1 8 1.0 62.0 -0.25
10 2018-08-24 2 8 1.0 63.0 -0.50
11 2018-08-25 3 8 1.0 64.0 1.00
我正在尝试使用以下 df:
计算玩家 'score' 的两个月滚动平均值df = pd.DataFrame({'player_id': [1098, 1098, 1098, 1098, 1116, 1116, 1116, 1116, 1116, 1116, 1116, 1116],
'date': ['2018-06-22', '2018-06-23', '2018-07-24', '2018-07-25',
'2018-07-22', '2018-07-23', '2018-07-24', '2018-07-25',
'2018-08-22', '2018-08-23', '2018-08-24', '2018-08-25'],
'score': [-2,1,2,3,-8,3,2,-3,-2,1,2,3]})
我希望将平均值按 'player_ID' 列分组,这样玩家 1098 每行的平均值分别为 -2、-0.5、0.33 和 1。如果 2020 年只有一个日期,那就是那个值。如果日期是 2020 年 1 月和 2 月,则 1 月行就是该值,而 2 月行将是这两个值的平均值。
您可以按 'player_id'
分组并在 'score'
列上使用 expanding().mean()
方法:
df['rolling mean'] = df.groupby('player_id')['score'].expanding().mean().round(2).droplevel(0)
编辑:
鉴于评论中的新信息,也许您需要 rolling.mean
方法。从 OP 向数据框添加了一些行,以更好地显示正在发生的事情。对于数据框 df
:
df = pd.DataFrame({'player_id': [1098, 1098, 1098, 1098, 1098, 1098, 1098, 1098,
1116, 1116, 1116, 1116, 1116, 1116, 1116, 1116],
'date': ['2018-06-22', '2018-06-23', '2018-07-24', '2018-07-25',
'2019-06-22', '2019-06-25', '2019-07-25', '2020-06-22',
'2018-07-22', '2018-07-23', '2018-07-24', '2018-07-25',
'2018-08-22', '2018-08-23', '2018-08-24', '2018-08-25'],
'score': [-2, 1, 2, 3, 7, 8, 6, 5, -8, 3, 2, -3, -2, 1, 2, 3]})
我们在这里找到每个 'player_id'
:
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(by=['player_id','date'])
df['rolling_mean'] = df.set_index('date').groupby('player_id', sort=False)['score'].rolling('60D').mean().round(2).to_numpy()
输出:
player_id date score rolling_mean
0 1098 2018-06-22 -2 -2.00
1 1098 2018-06-23 1 -0.50
2 1098 2018-07-24 2 0.33
3 1098 2018-07-25 3 1.00
6 1098 2019-06-22 7 7.00
7 1098 2019-06-25 8 7.50
5 1098 2019-07-25 6 7.00
4 1098 2020-06-22 5 5.00
8 1116 2018-07-22 -8 -8.00
9 1116 2018-07-23 3 -2.50
10 1116 2018-07-24 2 -1.00
11 1116 2018-07-25 -3 -1.50
12 1116 2018-08-22 -2 -1.60
13 1116 2018-08-23 1 -1.17
14 1116 2018-08-24 2 -0.71
15 1116 2018-08-25 3 -0.25
不确定这是否是最优雅的,但给你:
import pandas as pd
#DataFrame you provided
df = pd.DataFrame({'player_id': [1098, 1098, 1098, 1098, 1116, 1116, 1116, 1116, 1116, 1116, 1116, 1116],
'date': ['2018-06-22', '2018-06-23', '2018-07-24', '2018-07-25',
'2018-07-22', '2018-07-23', '2018-07-24', '2018-07-25',
'2018-08-22', '2018-08-23', '2018-08-24', '2018-08-25'],
'score': [-2,1,2,3,-8,3,2,-3,-2,1,2,3]})
#As best practice, set date strings to date type
df['date'] = pd.to_datetime(df['date'])
"""
Group by player and date, then add a rolling average with minimum level 1
(meaning you can start rolling from the first row). The rolling function creates a series
from your dataframe
"""
rolling_average_series = df.groupby(by=['player_id','date']).rolling(1)['score'].mean()
#Reset series as dataframe with .to_frame then reset index so that player_id and date are still columns
df_grouped = rolling_average_series.to_frame().reset_index(level=['player_id','date'])
对于奇怪的格式表示歉意。
df = pd.DataFrame({'player_id': [1098, 1098, 1098, 1098, 1116, 1116, 1116, 1116, 1116, 1116, 1116, 1116],
'date': ['2018-06-22', '2018-06-23', '2018-07-24', '2018-07-25',
'2018-07-22', '2018-07-23', '2018-07-24', '2018-07-25',
'2018-08-22', '2018-08-23', '2018-08-24', '2018-08-25'],
'score': [-2,1,2,3,-8,3,2,-3,-2,1,2,3]})
window=4
min_periods=3
cap=10
def get_month(date):
return date.month
df=df.sort_values(by='date')
df['date']=pd.to_datetime(df['date'])
df['month']=df['date'].apply(get_month)
df['elapsed_days'] = (df['date']-df['date'].shift(1)).dt.days
df['accumulative_elapsed_days'] = df['elapsed_days'].cumsum()
df['rolling_mean'] = df['score'].rolling(window, min_periods).mean().round(2)
print(df[['date','score','month','elapsed_days','accum_elapsed_days','rolling_mean']]) df.set_index('date',inplace=True)
df[['score','rolling_mean']].plot()
plt.show()
输出:
date score month elapsed_days accum_elapsed_days rolling_mean
0 2018-06-22 -2 6 NaN NaN NaN
1 2018-06-23 1 6 1.0 1.0 NaN
4 2018-07-22 -8 7 29.0 30.0 -3.00
5 2018-07-23 3 7 1.0 31.0 -1.50
2 2018-07-24 2 7 1.0 32.0 -0.50
6 2018-07-24 2 7 0.0 32.0 -0.25
3 2018-07-25 3 7 1.0 33.0 2.50
7 2018-07-25 -3 7 0.0 33.0 1.00
8 2018-08-22 -2 8 28.0 61.0 0.00
9 2018-08-23 1 8 1.0 62.0 -0.25
10 2018-08-24 2 8 1.0 63.0 -0.50
11 2018-08-25 3 8 1.0 64.0 1.00