Group By pandas df 并创建一个包含嵌套字典的列

Groupby pandas df and create a colum with nested dictionary

鉴于此 df:

        dim_date_id closing_type    r_d variable    value   rolling cusum_sample    sample_type
1330    1995-10-27      low         1     low      9.699377  0.039688   1   [sh_dummy_0.5, sh_dummy_1]
1331    1995-10-27      low         1    close    10.340971  0.044784   1   [sh_dummy_0.5, sh_dummy_1]
1330    1995-10-27      high        1    high     10.529675  0.062868   1   [sh_dummy_0.5, sh_dummy_1, sh_dummy_2]
1331    1995-10-27      high        1    close    10.340971  0.044784   1   [sh_dummy_0.5, sh_dummy_1, sh_dummy_2]
1330    1995-10-27      low         5     low      9.699377  0.132976   1   [sh_dummy_0.5, sh_dummy_1, sh_dummy_2]
1331    1995-10-27      low         5   close     10.340971  0.188179   1   [sh_dummy_0.5, sh_dummy_1, sh_dummy_2]
1330    1995-10-27      high        5    high     10.529675  0.184475   1   [sh_dummy_0.5, sh_dummy_1, sh_dummy_2]

我想根据 variable 对它进行分组并创建一个嵌套字典到 colum 样本类型(或我不太关心的其他类型)中。作为输出,我想要一个看起来像这样的 df

       dim_date_id      variable   value      sample_type
1330    1995-10-27       low      9.699377     {'r_d':1,'closing_type':'low','rolling':0.039688,'sample':[sh_dummy_0.5, sh_dummy_1]},
                                           {'r_d':5,'closing_type':'low','rolling':0.132976,'sample':[sh_dummy_0.5, sh_dummy_1, sh_dummy_2]

1331    1995-10-27      close    10.340971  {'r_d':1,'closing_type':'low','rolling':0.044784,'sample':[sh_dummy_0.5, sh_dummy_1]},
                                         {'r_d':1,'closing_type':'high','rolling':0.062868,'sample':[sh_dummy_0.5, sh_dummy_1, sh_dummy_2], 
                                         {'r_d':5,'closing_type':'low','rolling':0.188179,'sample':[sh_dummy_0.5, sh_dummy_1, sh_dummy_2],

1330    1995-10-27      high     10.529675    {'r_d':1,'closing_type':'high','rolling':0.062868,'sample':[sh_dummy_0.5, sh_dummy_1, sh_dummy_2]},
                                           {'r_d':5,'closing_type':'high','rolling':0.184475,'sample':[sh_dummy_0.5, sh_dummy_1, sh_dummy_2]

它必须尽可能灵活,因为在 sample_type 列中有时也可以有 'n' 个不同的变量。

试试这个:

new_df = df.groupby(['dim_date_id','variable','value']).apply(lambda x: x.to_dict()).reset_index(name='sample_type')

输出:

>>> new_df
  dim_date_id variable      value                                        sample_type
0  1995-10-27    close  10.340971  {'dim_date_id': {1331: '1995-10-27'}, 'closing...
1  1995-10-27     high  10.529675  {'dim_date_id': {1330: '1995-10-27'}, 'closing...
2  1995-10-27      low   9.699377  {'dim_date_id': {1330: '1995-10-27'}, 'closing...