pandas 中的多个嵌套 groupby

multiple nested groupby in pandas

这是我的 pandas 数据框:

df = pd.DataFrame({'Date': {0: '2016-10-11', 1: '2016-10-11', 2: '2016-10-11', 3: '2016-10-11', 4: '2016-10-11',5: '2016-10-12',6: '2016-10-12',7: '2016-10-12',8: '2016-10-12',9: '2016-10-12'}, 'Stock': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H',8: 'I', 9:'J'}, 'Sector': {0: 0,1: 0, 2: 1, 3: 1, 4: 1, 5: 0, 6:0, 7:0, 8:1, 9:1}, 'Segment': {0: 0, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6:2,7:2,8:3,9:3}, 'Range': {0: 5, 1: 0, 2: 1, 3: 0, 4: 2, 5: 6, 6:0, 7:23, 8:5, 9:5}})

外观如下:

我要添加以下列:

这将是输出:

res = pd.DataFrame({'Date': {0: '2016-10-11', 1: '2016-10-11', 2: '2016-10-11', 3: '2016-10-11', 4: '2016-10-11',5: '2016-10-12',6: '2016-10-12',7: '2016-10-12',8: '2016-10-12',9: '2016-10-12'}, 'Stock': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H',8: 'I', 9:'J'}, 'Sector': {0: 0,1: 0, 2: 1, 3: 1, 4: 1, 5: 0, 6:0, 7:0, 8:1, 9:1}, 'Segment': {0: 0, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6:2,7:2,8:3,9:3}, 'Range': {0: 5, 1: 0, 2: 1, 3: 0, 4: 2, 5: 6, 6:0, 7:23, 8:5, 9:5}, 'Date_Range_Avg':{0: 1.6, 1: 1.6, 2: 1.6, 3: 1.6, 4: 1.6, 5: 7.8, 6: 7.8, 7: 7.8, 8:7.8, 9: 7.8}, 'Date_Sector_Range_Avg':{0: 2.5, 1: 2.5, 2: 1, 3: 1, 4: 1, 5: 9.67, 6: 9.67, 7: 9.67, 8: 9.67, 9: 9.67}, 'Date_Segment_Range_Avg':{0: 5, 1: 0.75, 2: 0.75, 3: 0.75, 4: 0.75, 5: 6, 6: 11.5, 7: 11.5, 8: 5, 9: 5}})

这是它的样子:

请注意,我已经四舍五入了一些值 - 但这种四舍五入对于我的问题来说并不是必需的(请不要四舍五入)

我知道我可以分别进行这些分组,但我觉得效率很低(我的数据集包含数百万行)

基本上,我想先按 Date 进行分组,然后再使用它按 Date and SegmentDate and Sector 进行两个更细粒度的分组。

如何操作?

我最初的预感是这样的:

day_groups = df.groupby("Date")
df['Date_Range_Avg'] = day_groups['Range'].transform('mean')

然后重新使用 day_groups 来执行 2 个更细粒度的 groupby,如下所示:

df['Date_Sector_Range_Avg'] = day_groups.groupby('Segment')[Range].transform('mean')

当你得到时,这不起作用:

'AttributeError: 'DataFrameGroupBy' 对象没有属性 'groupby''

groupby 当聚合函数被向量化时运行得非常快。如果您担心性能,请先尝试一下,看看它是否是您程序中真正的瓶颈。

您可以创建临时数据框来保存每个 groupby 的结果,然后 merge 它们 df:

group_bys = {
    "Date_Range_Avg": ["Date"],
    "Date_Sector_Range_Avg": ["Date", "Sector"],
    "Date_Segment_Range_Avg": ["Date", "Segment"]
}

tmp = [
    df.groupby(columns)["Range"].mean().to_frame(key)
    for key, columns in group_bys.items()
]

result = df
for t in tmp:
    result = result.merge(t, left_on=t.index.names, right_index=True)

结果:

         Date Stock  Sector  Segment  Range  Date_Range_Avg  Date_Sector_Range_Avg  Date_Segment_Range_Avg
0  2016-10-11     A       0        0      5             1.6               2.500000                    5.00
1  2016-10-11     B       0        1      0             1.6               2.500000                    0.75
2  2016-10-11     C       1        1      1             1.6               1.000000                    0.75
3  2016-10-11     D       1        1      0             1.6               1.000000                    0.75
4  2016-10-11     E       1        1      2             1.6               1.000000                    0.75
5  2016-10-12     F       0        1      6             7.8               9.666667                    6.00
6  2016-10-12     G       0        2      0             7.8               9.666667                   11.50
7  2016-10-12     H       0        2     23             7.8               9.666667                   11.50
8  2016-10-12     I       1        3      5             7.8               5.000000                    5.00
9  2016-10-12     J       1        3      5             7.8               5.000000                    5.00

另一种选择是使用 transform,避免多次合并:

# reusing your code
group_bys = {
         "Date_Range_Avg": ["Date"],
         "Date_Sector_Range_Avg": ["Date", "Sector"],
         "Date_Segment_Range_Avg": ["Date", "Segment"]
     }
    
     tmp = {key : df.groupby(columns)["Range"].transform('mean')
         for key, columns in group_bys.items()
     }

df.assign(**tmp)