有没有更快的方法来做 Pandas groupby 累积平均值?
Is there a faster method to do a Pandas groupby cumulative mean?
我正在尝试在 Python 中创建一个查找参考 table 来计算玩家之前(按 datetime
)的比赛得分 cumulative mean
,按场地分组.但是,根据我的特定需要,玩家之前应该在相关场地至少玩过 2 次才能进行 'Venue Preference'
cumulative mean
计算。
df
格式如下所示:
DateTime
Player
Venue
Score
2021-09-25 17:15:00
Tim
Stadium A
20
2021-09-27 10:00:00
Blake
Stadium B
30
我现有的代码可以完美运行,但不幸的是速度很慢,如下所示:
import numpy as np
import pandas as pd
VenueSum = pd.DataFrame(df.groupby(['DateTime', 'Player', 'Venue'])['Score'].sum().reset_index(name = 'Sum'))
VenueSum['Cumulative Sum'] = VenueSum.sort_values('DateTime').groupby(['Player', 'Venue'])['Sum'].cumsum()
VenueCount = pd.DataFrame(df.groupby(['DateTime', 'Player', 'Venue'])['Score'].count().reset_index(name = 'Count'))
VenueCount['Cumulative Count'] = VenueCount.sort_values('DateTime').groupby(['Player', 'Venue'])['Count'].cumsum()
VenueLookup = VenueSum.merge(VenueCount, how = 'outer', on = ['DateTime', 'Player', 'Venue'])
VenueLookup['Venue Preference'] = np.where(VenueLookup['Cumulative Count'] >= 2, VenueLookup['Cumulative Sum'] / VenueLookup['Cumulative Count'], np.nan)
VenueLookup = VenueLookup.drop(['Sum', 'Cumulative Sum', 'Count', 'Cumulative Count'], axis = 1)
我确信有一种方法可以一步计算 cumulative mean
,而无需首先计算 cumulative sum
和 cumulative count
,但不幸的是我无法让它工作。
IIUC 首先按 sum
和 size
聚合删除 2 groupby,然后按两列累计和:
df1 = df.groupby(['DateTime', 'Player', 'Venue'])['Score'].agg(['sum','count'])
df1 = df1.groupby(['Player', 'Venue'])[['sum', 'count']].cumsum().reset_index()
df1['Venue Preference'] = np.where(df1['count'] >= 2, df1['sum'] / df1['count'], np.nan)
df1 = df1.drop(['sum', 'count'], axis=1)
print (df1)
DateTime Player Venue Venue Preference
0 2021-09-25 17:15:00 Tim Stadium A NaN
1 2021-09-27 10:00:00 Blake Stadium B NaN
我正在尝试在 Python 中创建一个查找参考 table 来计算玩家之前(按 datetime
)的比赛得分 cumulative mean
,按场地分组.但是,根据我的特定需要,玩家之前应该在相关场地至少玩过 2 次才能进行 'Venue Preference'
cumulative mean
计算。
df
格式如下所示:
DateTime | Player | Venue | Score |
---|---|---|---|
2021-09-25 17:15:00 | Tim | Stadium A | 20 |
2021-09-27 10:00:00 | Blake | Stadium B | 30 |
我现有的代码可以完美运行,但不幸的是速度很慢,如下所示:
import numpy as np
import pandas as pd
VenueSum = pd.DataFrame(df.groupby(['DateTime', 'Player', 'Venue'])['Score'].sum().reset_index(name = 'Sum'))
VenueSum['Cumulative Sum'] = VenueSum.sort_values('DateTime').groupby(['Player', 'Venue'])['Sum'].cumsum()
VenueCount = pd.DataFrame(df.groupby(['DateTime', 'Player', 'Venue'])['Score'].count().reset_index(name = 'Count'))
VenueCount['Cumulative Count'] = VenueCount.sort_values('DateTime').groupby(['Player', 'Venue'])['Count'].cumsum()
VenueLookup = VenueSum.merge(VenueCount, how = 'outer', on = ['DateTime', 'Player', 'Venue'])
VenueLookup['Venue Preference'] = np.where(VenueLookup['Cumulative Count'] >= 2, VenueLookup['Cumulative Sum'] / VenueLookup['Cumulative Count'], np.nan)
VenueLookup = VenueLookup.drop(['Sum', 'Cumulative Sum', 'Count', 'Cumulative Count'], axis = 1)
我确信有一种方法可以一步计算 cumulative mean
,而无需首先计算 cumulative sum
和 cumulative count
,但不幸的是我无法让它工作。
IIUC 首先按 sum
和 size
聚合删除 2 groupby,然后按两列累计和:
df1 = df.groupby(['DateTime', 'Player', 'Venue'])['Score'].agg(['sum','count'])
df1 = df1.groupby(['Player', 'Venue'])[['sum', 'count']].cumsum().reset_index()
df1['Venue Preference'] = np.where(df1['count'] >= 2, df1['sum'] / df1['count'], np.nan)
df1 = df1.drop(['sum', 'count'], axis=1)
print (df1)
DateTime Player Venue Venue Preference
0 2021-09-25 17:15:00 Tim Stadium A NaN
1 2021-09-27 10:00:00 Blake Stadium B NaN