如何使用 Pandas groupby 函数计算上一年的平均值?
How do I use the Pandas groupby function to calculate the mean for the previous year?
我正在尝试寻找一种方法来查找玩家在 "Last Season"
(上一年)的平均得分,并将其添加到原始数据框 df
.[=18 的新列中=]
我已经编写了一个公式来获取球员当年的平均得分,不包括当前行,如下所示:
df['Season Avg'] = df.groupby([df['Player'], df['DateTime'].dt.year])['Score']
.apply(lambda x: x.shift(1).expanding().mean())
然而,尽管我尽了最大努力使用 shift
函数,但我还是不太清楚如何将前几年的平均值 ("Last Season Avg"
) 直接计算到新列中。
dataframe设置如下:
Player
DateTime
Score
Season Avg
PlayerB
2020-MM-DD HH:MM:SS
40
NaN
PlayerA
2020-MM-DD HH:MM:SS
50
NaN
PlayerA
2021-MM-DD HH:MM:SS
100
NaN
PlayerB
2021-MM-DD HH:MM:SS
200
NaN
PlayerA
2021-MM-DD HH:MM:SS
160
100
PlayerB
2021-MM-DD HH:MM:SS
140
200
PlayerB
2021-MM-DD HH:MM:SS
160
170
PlayerA
2021-MM-DD HH:MM:SS
200
130
我想要的新的理想数据框:
Player
DateTime
Score
Season Avg
Last Season Avg
PlayerB
2020-MM-DD HH:MM:SS
40
NaN
NaN
PlayerA
2020-MM-DD HH:MM:SS
50
NaN
NaN
PlayerA
2021-MM-DD HH:MM:SS
100
NaN
50
PlayerB
2021-MM-DD HH:MM:SS
200
NaN
40
PlayerA
2021-MM-DD HH:MM:SS
160
100
50
PlayerB
2021-MM-DD HH:MM:SS
140
200
40
PlayerB
2021-MM-DD HH:MM:SS
160
170
40
PlayerA
2021-MM-DD HH:MM:SS
200
130
50
您可以groupby
一次通过“球员”和年份来找到每个球员的年平均值;然后 groupby
"Player" + shift
得到前一年的平均值。
out = df.groupby(['Player', df['DateTime'].dt.year])['Score'].mean().reset_index(name='Season Avg')
out['Last Season Avg'] = out['Season Avg'].groupby('Player').shift()
如果您正在寻找特定赛季之前的职业平均水平,您可以使用 expanding().mean()
:
out = df.groupby(['Player', df['DateTime'].dt.year])['Score'].expanding().mean().reset_index(name='Season Avg')
df['Career Avg by Last Season'] = df['Career Avg by Season'].groupby('Player').shift()
编辑:
提供了示例数据,我们来测试一下。这里的主要问题是“年”有重复的值。 @PaulRougieux 处理得非常优雅。这是另一种选择。这个想法是找到上赛季的平均值并将其映射回原始 df
(而不是对其进行转换)。
df['Last Season Avg'] = (df.set_index(['Player', df['DateTime'].str[:4]]).index
.map(df.groupby(['Player', df['DateTime'].str[:4]])['Score'].mean()
.groupby(level=0).shift()))
输出:
Player DateTime Score Season Avg Last Season Avg
0 PlayerB 2020-MM-DD HH:MM:SS 40 NaN NaN
1 PlayerA 2020-MM-DD HH:MM:SS 50 NaN NaN
2 PlayerA 2021-MM-DD HH:MM:SS 100 NaN 50.0
3 PlayerB 2021-MM-DD HH:MM:SS 200 NaN 40.0
4 PlayerA 2021-MM-DD HH:MM:SS 160 100.0 50.0
5 PlayerB 2021-MM-DD HH:MM:SS 140 200.0 40.0
6 PlayerB 2021-MM-DD HH:MM:SS 160 170.0 40.0
7 PlayerA 2021-MM-DD HH:MM:SS 200 130.0 50.0
创建示例数据集
import pandas
import numpy as np
df = pandas.DataFrame(
{'player': ['B', 'A', 'A', 'B', 'A', 'B', 'B', 'A'],
'datetime': ['2020-01-01', '2020-01-01', '2021-01-01', '2021-01-01',
'2021-01-01', '2021-01-01', '2021-01-01', '2021-01-01'],
'score': [40, 50, 100, 200, 160, 140, 160, 200],
}
)
df["datetime"] = pandas.to_datetime(df["datetime"])
df["year"] = df["datetime"].dt.year
使用变换将当前季节平均值添加到数据框
df["season_avg"] = df.groupby(["datetime", "player"])["score"].transform("mean")
df
player datetime score year season_avg
0 B 2020-01-01 40 2020 40.000000
1 A 2020-01-01 50 2020 50.000000
2 A 2021-01-01 100 2021 153.333333
3 B 2021-01-01 200 2021 166.666667
4 A 2021-01-01 160 2021 153.333333
5 B 2021-01-01 140 2021 166.666667
6 B 2021-01-01 160 2021 166.666667
7 A 2021-01-01 200 2021 153.333333
这里不能应用 Shift,因为年份是重复的
df.sort_values(["year"], ascending=True).groupby(["player"])["season_avg"].transform("shift")
0 NaN
1 NaN
2 50.000000
3 40.000000
4 153.333333
5 166.666667
6 166.666667
7 153.333333
Name: season_avg, dtype: float64
计算前一年的平均值并将它们连接到原始数据框
savg = (df.groupby(["year", "player"])
.agg(last_season_avg = ("score", "mean"))
.reset_index())
savg["year"] = savg["year"] + 1
savg
year player last_season_avg
0 2021 A 50.000000
1 2021 B 40.000000
2 2022 A 153.333333
3 2022 B 166.666667
df.merge(savg, on=["player", "year"], how="left" )
player datetime score year season_avg last_season_avg
0 B 2020-01-01 40 2020 40.000000 NaN
1 A 2020-01-01 50 2020 50.000000 NaN
2 A 2021-01-01 100 2021 153.333333 50.0
3 B 2021-01-01 200 2021 166.666667 40.0
4 A 2021-01-01 160 2021 153.333333 50.0
5 B 2021-01-01 140 2021 166.666667 40.0
6 B 2021-01-01 160 2021 166.666667 40.0
7 A 2021-01-01 200 2021 153.333333 50.0
另一种计算前一年平均值的方法,使用 shift
可能比 year + 1
更优雅。
savg = (df.groupby(["year", "player"])
.agg(season_avg = ("score", "mean"))
.reset_index()
.sort_values(["year"])
)
savg["last_season_avg"] = savg.groupby(["player"])["season_avg"].transform("shift")
我正在尝试寻找一种方法来查找玩家在 "Last Season"
(上一年)的平均得分,并将其添加到原始数据框 df
.[=18 的新列中=]
我已经编写了一个公式来获取球员当年的平均得分,不包括当前行,如下所示:
df['Season Avg'] = df.groupby([df['Player'], df['DateTime'].dt.year])['Score']
.apply(lambda x: x.shift(1).expanding().mean())
然而,尽管我尽了最大努力使用 shift
函数,但我还是不太清楚如何将前几年的平均值 ("Last Season Avg"
) 直接计算到新列中。
dataframe设置如下:
Player | DateTime | Score | Season Avg |
---|---|---|---|
PlayerB | 2020-MM-DD HH:MM:SS | 40 | NaN |
PlayerA | 2020-MM-DD HH:MM:SS | 50 | NaN |
PlayerA | 2021-MM-DD HH:MM:SS | 100 | NaN |
PlayerB | 2021-MM-DD HH:MM:SS | 200 | NaN |
PlayerA | 2021-MM-DD HH:MM:SS | 160 | 100 |
PlayerB | 2021-MM-DD HH:MM:SS | 140 | 200 |
PlayerB | 2021-MM-DD HH:MM:SS | 160 | 170 |
PlayerA | 2021-MM-DD HH:MM:SS | 200 | 130 |
我想要的新的理想数据框:
Player | DateTime | Score | Season Avg | Last Season Avg |
---|---|---|---|---|
PlayerB | 2020-MM-DD HH:MM:SS | 40 | NaN | NaN |
PlayerA | 2020-MM-DD HH:MM:SS | 50 | NaN | NaN |
PlayerA | 2021-MM-DD HH:MM:SS | 100 | NaN | 50 |
PlayerB | 2021-MM-DD HH:MM:SS | 200 | NaN | 40 |
PlayerA | 2021-MM-DD HH:MM:SS | 160 | 100 | 50 |
PlayerB | 2021-MM-DD HH:MM:SS | 140 | 200 | 40 |
PlayerB | 2021-MM-DD HH:MM:SS | 160 | 170 | 40 |
PlayerA | 2021-MM-DD HH:MM:SS | 200 | 130 | 50 |
您可以groupby
一次通过“球员”和年份来找到每个球员的年平均值;然后 groupby
"Player" + shift
得到前一年的平均值。
out = df.groupby(['Player', df['DateTime'].dt.year])['Score'].mean().reset_index(name='Season Avg')
out['Last Season Avg'] = out['Season Avg'].groupby('Player').shift()
如果您正在寻找特定赛季之前的职业平均水平,您可以使用 expanding().mean()
:
out = df.groupby(['Player', df['DateTime'].dt.year])['Score'].expanding().mean().reset_index(name='Season Avg')
df['Career Avg by Last Season'] = df['Career Avg by Season'].groupby('Player').shift()
编辑:
提供了示例数据,我们来测试一下。这里的主要问题是“年”有重复的值。 @PaulRougieux 处理得非常优雅。这是另一种选择。这个想法是找到上赛季的平均值并将其映射回原始 df
(而不是对其进行转换)。
df['Last Season Avg'] = (df.set_index(['Player', df['DateTime'].str[:4]]).index
.map(df.groupby(['Player', df['DateTime'].str[:4]])['Score'].mean()
.groupby(level=0).shift()))
输出:
Player DateTime Score Season Avg Last Season Avg
0 PlayerB 2020-MM-DD HH:MM:SS 40 NaN NaN
1 PlayerA 2020-MM-DD HH:MM:SS 50 NaN NaN
2 PlayerA 2021-MM-DD HH:MM:SS 100 NaN 50.0
3 PlayerB 2021-MM-DD HH:MM:SS 200 NaN 40.0
4 PlayerA 2021-MM-DD HH:MM:SS 160 100.0 50.0
5 PlayerB 2021-MM-DD HH:MM:SS 140 200.0 40.0
6 PlayerB 2021-MM-DD HH:MM:SS 160 170.0 40.0
7 PlayerA 2021-MM-DD HH:MM:SS 200 130.0 50.0
创建示例数据集
import pandas
import numpy as np
df = pandas.DataFrame(
{'player': ['B', 'A', 'A', 'B', 'A', 'B', 'B', 'A'],
'datetime': ['2020-01-01', '2020-01-01', '2021-01-01', '2021-01-01',
'2021-01-01', '2021-01-01', '2021-01-01', '2021-01-01'],
'score': [40, 50, 100, 200, 160, 140, 160, 200],
}
)
df["datetime"] = pandas.to_datetime(df["datetime"])
df["year"] = df["datetime"].dt.year
使用变换将当前季节平均值添加到数据框
df["season_avg"] = df.groupby(["datetime", "player"])["score"].transform("mean")
df
player datetime score year season_avg
0 B 2020-01-01 40 2020 40.000000
1 A 2020-01-01 50 2020 50.000000
2 A 2021-01-01 100 2021 153.333333
3 B 2021-01-01 200 2021 166.666667
4 A 2021-01-01 160 2021 153.333333
5 B 2021-01-01 140 2021 166.666667
6 B 2021-01-01 160 2021 166.666667
7 A 2021-01-01 200 2021 153.333333
这里不能应用 Shift,因为年份是重复的
df.sort_values(["year"], ascending=True).groupby(["player"])["season_avg"].transform("shift")
0 NaN
1 NaN
2 50.000000
3 40.000000
4 153.333333
5 166.666667
6 166.666667
7 153.333333
Name: season_avg, dtype: float64
计算前一年的平均值并将它们连接到原始数据框
savg = (df.groupby(["year", "player"])
.agg(last_season_avg = ("score", "mean"))
.reset_index())
savg["year"] = savg["year"] + 1
savg
year player last_season_avg
0 2021 A 50.000000
1 2021 B 40.000000
2 2022 A 153.333333
3 2022 B 166.666667
df.merge(savg, on=["player", "year"], how="left" )
player datetime score year season_avg last_season_avg
0 B 2020-01-01 40 2020 40.000000 NaN
1 A 2020-01-01 50 2020 50.000000 NaN
2 A 2021-01-01 100 2021 153.333333 50.0
3 B 2021-01-01 200 2021 166.666667 40.0
4 A 2021-01-01 160 2021 153.333333 50.0
5 B 2021-01-01 140 2021 166.666667 40.0
6 B 2021-01-01 160 2021 166.666667 40.0
7 A 2021-01-01 200 2021 153.333333 50.0
另一种计算前一年平均值的方法,使用 shift
可能比 year + 1
更优雅。
savg = (df.groupby(["year", "player"])
.agg(season_avg = ("score", "mean"))
.reset_index()
.sort_values(["year"])
)
savg["last_season_avg"] = savg.groupby(["player"])["season_avg"].transform("shift")