如何创建一个新的 pandas 列,它是最后 3 个值的平均值?

How to make a new pandas column that's the average of the last 3 values?

假设我有一个包含 3 列的数据框,dt, unit, sold。我想知道的是如何创建一个名为 say,prior_3_avg 的新列,顾名思义,过去三个 sold 乘以 unit 的平均值与 dt 的同一天。例如,对于 2020 年 5 月 5 日的“1”单元,它在 4 月 28 日、21 日和 14 日(最后三个星期四)的平均销量是多少?

玩具样品数据:

df = pd.DataFrame({'dt':['2020-5-1','2020-5-2','2020-5-3','2020-5-4','2020-5-5','2020-5-6','2020-5-7','2020-5-8','2020-5-9','2020-5-10','2020-5-11','2020-5-12','2020-5-13','2020-5-14','2020-5-15','2020-5-16','2020-5-17','2020-5-18','2020-5-19','2020-5-20','2020-5-21','2020-5-22','2020-5-23','2020-5-24','2020-5-25','2020-5-26','2020-5-27','2020-5-28','2020-5-1','2020-5-2','2020-5-3','2020-5-4','2020-5-5','2020-5-6','2020-5-7','2020-5-8','2020-5-9','2020-5-10','2020-5-11','2020-5-12','2020-5-13','2020-5-14','2020-5-15','2020-5-16','2020-5-17','2020-5-18','2020-5-19','2020-5-20','2020-5-21','2020-5-22','2020-5-23','2020-5-24','2020-5-25','2020-5-26','2020-5-27','2020-5-28',],'unit':[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],'sold':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28]})

df['dt'] = pd.to_datetime(df['dt'])

           dt  unit  sold 
0  2020-05-01     1     1
1  2020-05-02     1     2
2  2020-05-03     1     3
3  2020-05-04     1     4
4  2020-05-05     1     5
5  2020-05-06     1     6
...

我该怎么做?我见过:

这解释了如何对列进行分组。我想我可以做一个“星期几”列,但我仍然有同样的问题想要限制到过去 3 个匹配的星期几值而不是所有结果。

这可能与此有关,但这看起来更像是对一次性分析有用,而不是创建一个新列:

首先用日期创建一个新列

import pandas as pd
  
  
date = pd.date_range('2018-12-30', '2019-01-07',
                     freq='D').to_series()
date.dt.dayofweek

这会给你当天的数字,之后你只需要用月份过滤并对值进行排序

这应该有效:

df['dayofweek'] = df['dt'].dt.dayofweek
df['output'] = df.apply(lambda x: df['sold'][(df.index < x.name) & (df.dayofweek == x.dayofweek)].tail(3).sum(), axis = 1)

这是一个想法:首先按 unit 分组,然后按工作日对每个 unit 分组,并获得 n 周的滚动平均值(closed='left',最后n不包括当前的用于计算,这似乎是你想要的)...

n = 3
result = (df.groupby('unit')
          .apply(lambda f: (f['sold']
                            .groupby(f.dt.dt.day_name())
                            .rolling(n, closed='left')
                            .mean()
                           )
                )
          )

...这导致了这个系列:

unit  dt           
1     Friday     0      NaN
                 7      NaN
                 14     NaN
                 21     8.0
      Monday     3      NaN
                 10     NaN
                 17     NaN
                 24    11.0
      ...
2     Friday     28     NaN
                 35     NaN
                 42     NaN
                 49     8.0
      Monday     31     NaN
                 38     NaN
                 45     NaN
                 52    11.0
      ...
Name: sold, dtype: float64

接下来,删除 unittime 索引级别,我们不需要它们。 此外,重命名该系列更容易 joining。

result = result.reset_index(level=[0, 1], drop=True)
result = result.rename('prior_3_avg')

回到母舰...

df2 = df.join(result)

df2中的部分最终结果:

         time  unit  sold  prior_3_avg
... # first 21 are NaN
21 2020-05-22     1    22          8.0
22 2020-05-23     1    23          9.0
23 2020-05-24     1    24         10.0
24 2020-05-25     1    25         11.0
25 2020-05-26     1    26         12.0
26 2020-05-27     1    27         13.0
27 2020-05-28     1    28         14.0