如何使用 Pandas groupby 函数计算上一年的平均值?

How do I use the Pandas groupby function to calculate the mean for the previous year?

我正在尝试寻找一种方法来查找玩家在 "Last Season"(上一年)的平均得分,并将其添加到原始数据框 df.[=18 的新列中=]

我已经编写了一个公式来获取球员当年的平均得分,不包括当前行,如下所示:

df['Season Avg'] = df.groupby([df['Player'], df['DateTime'].dt.year])['Score']
                   .apply(lambda x: x.shift(1).expanding().mean())

然而,尽管我尽了最大努力使用 shift 函数,但我还是不太清楚如何将前几年的平均值 ("Last Season Avg") 直接计算到新列中。

dataframe设置如下:

Player DateTime Score Season Avg
PlayerB 2020-MM-DD HH:MM:SS 40 NaN
PlayerA 2020-MM-DD HH:MM:SS 50 NaN
PlayerA 2021-MM-DD HH:MM:SS 100 NaN
PlayerB 2021-MM-DD HH:MM:SS 200 NaN
PlayerA 2021-MM-DD HH:MM:SS 160 100
PlayerB 2021-MM-DD HH:MM:SS 140 200
PlayerB 2021-MM-DD HH:MM:SS 160 170
PlayerA 2021-MM-DD HH:MM:SS 200 130

我想要的新的理想数据框:

Player DateTime Score Season Avg Last Season Avg
PlayerB 2020-MM-DD HH:MM:SS 40 NaN NaN
PlayerA 2020-MM-DD HH:MM:SS 50 NaN NaN
PlayerA 2021-MM-DD HH:MM:SS 100 NaN 50
PlayerB 2021-MM-DD HH:MM:SS 200 NaN 40
PlayerA 2021-MM-DD HH:MM:SS 160 100 50
PlayerB 2021-MM-DD HH:MM:SS 140 200 40
PlayerB 2021-MM-DD HH:MM:SS 160 170 40
PlayerA 2021-MM-DD HH:MM:SS 200 130 50

您可以groupby一次通过“球员”和年份来找到每个球员的年平均值;然后 groupby "Player" + shift 得到前一年的平均值。

out = df.groupby(['Player', df['DateTime'].dt.year])['Score'].mean().reset_index(name='Season Avg')
out['Last Season Avg'] = out['Season Avg'].groupby('Player').shift()

如果您正在寻找特定赛季之前的职业平均水平,您可以使用 expanding().mean():

out = df.groupby(['Player', df['DateTime'].dt.year])['Score'].expanding().mean().reset_index(name='Season Avg')
df['Career Avg by Last Season'] = df['Career Avg by Season'].groupby('Player').shift()

编辑:

提供了示例数据,我们来测试一下。这里的主要问题是“年”有重复的值。 @PaulRougieux 处理得非常优雅。这是另一种选择。这个想法是找到上赛季的平均值并将其映射回原始 df(而不是对其进行转换)。

df['Last Season Avg'] = (df.set_index(['Player', df['DateTime'].str[:4]]).index
                             .map(df.groupby(['Player', df['DateTime'].str[:4]])['Score'].mean()
                                  .groupby(level=0).shift()))

输出:

    Player             DateTime  Score  Season Avg  Last Season Avg
0  PlayerB  2020-MM-DD HH:MM:SS     40         NaN              NaN
1  PlayerA  2020-MM-DD HH:MM:SS     50         NaN              NaN
2  PlayerA  2021-MM-DD HH:MM:SS    100         NaN             50.0
3  PlayerB  2021-MM-DD HH:MM:SS    200         NaN             40.0
4  PlayerA  2021-MM-DD HH:MM:SS    160       100.0             50.0
5  PlayerB  2021-MM-DD HH:MM:SS    140       200.0             40.0
6  PlayerB  2021-MM-DD HH:MM:SS    160       170.0             40.0
7  PlayerA  2021-MM-DD HH:MM:SS    200       130.0             50.0

创建示例数据集

import pandas
import numpy as np
df = pandas.DataFrame(
    {'player': ['B', 'A', 'A', 'B', 'A', 'B', 'B', 'A'],
     'datetime': ['2020-01-01', '2020-01-01', '2021-01-01', '2021-01-01',
                  '2021-01-01', '2021-01-01', '2021-01-01', '2021-01-01'],
     'score': [40, 50, 100, 200, 160, 140, 160, 200],
    }
)
df["datetime"] = pandas.to_datetime(df["datetime"])
df["year"] = df["datetime"].dt.year

使用变换将当前季节平均值添加到数据框

df["season_avg"] = df.groupby(["datetime", "player"])["score"].transform("mean")
df

  player   datetime  score  year  season_avg
0      B 2020-01-01     40  2020   40.000000
1      A 2020-01-01     50  2020   50.000000
2      A 2021-01-01    100  2021  153.333333
3      B 2021-01-01    200  2021  166.666667
4      A 2021-01-01    160  2021  153.333333
5      B 2021-01-01    140  2021  166.666667
6      B 2021-01-01    160  2021  166.666667
7      A 2021-01-01    200  2021  153.333333

这里不能应用 Shift,因为年份是重复的

df.sort_values(["year"], ascending=True).groupby(["player"])["season_avg"].transform("shift")

0           NaN
1           NaN
2     50.000000
3     40.000000
4    153.333333
5    166.666667
6    166.666667
7    153.333333
Name: season_avg, dtype: float64

计算前一年的平均值并将它们连接到原始数据框

savg = (df.groupby(["year", "player"])
        .agg(last_season_avg = ("score", "mean"))
        .reset_index())
savg["year"] = savg["year"] + 1
savg

   year player  last_season_avg
0  2021      A        50.000000
1  2021      B        40.000000
2  2022      A       153.333333
3  2022      B       166.666667

df.merge(savg, on=["player", "year"], how="left" )

  player   datetime  score  year  season_avg  last_season_avg
0      B 2020-01-01     40  2020   40.000000              NaN
1      A 2020-01-01     50  2020   50.000000              NaN
2      A 2021-01-01    100  2021  153.333333             50.0
3      B 2021-01-01    200  2021  166.666667             40.0
4      A 2021-01-01    160  2021  153.333333             50.0
5      B 2021-01-01    140  2021  166.666667             40.0
6      B 2021-01-01    160  2021  166.666667             40.0
7      A 2021-01-01    200  2021  153.333333             50.0

另一种计算前一年平均值的方法,使用 shift 可能比 year + 1 更优雅。

savg = (df.groupby(["year", "player"])
        .agg(season_avg = ("score", "mean"))
        .reset_index()
        .sort_values(["year"])
       )
savg["last_season_avg"] = savg.groupby(["player"])["season_avg"].transform("shift")