pandas 中超过一列的滚动平均值和聚合

rolling average and aggregate more than one column in pandas

如何将 'reviewer' 列表与 'quantities' 的平均值一起聚合? 对于如下所示的数据框,我可以成功计算出每 3 年每组的平均数量。如何添加一个额外的列来汇总每个时期的“审阅者”列的值?例如,对于 1993 年的公司 'A',该列将为 [[p1,p2],[p3,p2],[p4]].

df= pd.DataFrame(data=[
    ['A', 1990, 2,['p1','p2']],
    ['A', 1991,3,['p3','p2']],
    ['A', 1993,5,['p4']],
    ['A',2000,4,['p1','p5','p7']],
    ['B',2000,1, ['p3']],
    ['B',2001,2,['p6','p9']],
    ['B',2002,3,['p10','p1']]], columns=['company', 'year','quantity', 'reviewer'])

df['rolling_average'] = (df.groupby(['company'])
    .rolling(3).agg({'quantity':'mean'}).reset_index(level=[0], drop=True))

当前输出如下:

| index | company | year | quantity | reviewer | rolling_average |
| :---- | :------ | :--- | :------- | :------- | :-------------- |
| 0     | A       | 1990 | 2        | [p1, p2] | NaN             |
| 1     | A       | 1991 | 3        | [p3, p2] | NaN             |
| 2     | A       | 1993 | 5        | [p4]     | 3.33            |
| 3     | A       | 2000 | 4        | [p5, p7] | 4.00            |
| 4     | B       | 2000 | 1        | [p3]     | NaN             |
| 5     | B       | 2001 | 2        | [p6, p9] | NaN             |
| 6     | B       | 2002 | 3        | [p10, p1]| 2.00            |

由于rolling不能取非数值,这里需要自定义rolling

n = 3
df['new'] = df.groupby(['company'])['reviewer'].apply(lambda x :[x[y-n:y].tolist() if y>=n else np.nan for y in range(1,len(x)+1)]).explode().values
df
  company  year  quantity      reviewer                             new
0       A  1990         2      [p1, p2]                             NaN
1       A  1991         3      [p3, p2]                             NaN
2       A  1993         5          [p4]      [[p1, p2], [p3, p2], [p4]]
3       A  2000         4  [p1, p5, p7]  [[p3, p2], [p4], [p1, p5, p7]]
4       B  2000         1          [p3]                             NaN
5       B  2001         2      [p6, p9]                             NaN
6       B  2002         3     [p10, p1]     [[p3], [p6, p9], [p10, p1]]