从 pandas 中的多索引列的一部分创建列

Making a column out of a part of an multiindex column in pandas

我有一个df:

df_test = pd.DataFrame.from_dict({('group', ''): {0: 'A',
  1: 'A',
  2: 'A',
  3: 'A',
  4: 'A',
  5: 'A',
  6: 'A',
  7: 'A',
  8: 'A',
  9: 'B',
  10: 'B',
  11: 'B',
  12: 'B',
  13: 'B',
  14: 'B',
  15: 'B',
  16: 'B',
  17: 'B',
  18: 'all',
  19: 'all'},
 ('category', ''): {0: 'Amazon',
  1: 'Apple',
  2: 'Facebook',
  3: 'Google',
  4: 'Netflix',
  5: 'Tesla',
  6: 'Total',
  7: 'Uber',
  8: 'total',
  9: 'Amazon',
  10: 'Apple',
  11: 'Facebook',
  12: 'Google',
  13: 'Netflix',
  14: 'Tesla',
  15: 'Total',
  16: 'Uber',
  17: 'total',
  18: 'Total',
  19: 'total'},
 (pd.Timestamp('2020-06-29'), 'last_sales'): {0: 195.0,
  1: 61.0,
  2: 106.0,
  3: 61.0,
  4: 37.0,
  5: 13.0,
  6: 954.0,
  7: 4.0,
  8: 477.0,
  9: 50.0,
  10: 50.0,
  11: 75.0,
  12: 43.0,
  13: 17.0,
  14: 14.0,
  15: 504.0,
  16: 3.0,
  17: 252.0,
  18: 2916.0,
  19: 2916.0},
 (pd.Timestamp('2020-06-29'), 'sales'): {0: 1268.85,
  1: 18274.385000000002,
  2: 19722.65,
  3: 55547.255,
  4: 15323.800000000001,
  5: 1688.6749999999997,
  6: 227463.23,
  7: 1906.0,
  8: 113731.615,
  9: 3219.6499999999996,
  10: 15852.060000000001,
  11: 17743.7,
  12: 37795.15,
  13: 5918.5,
  14: 1708.75,
  15: 166349.64,
  16: 937.01,
  17: 83174.82,
  18: 787625.7400000001,
  19: 787625.7400000001},
 (pd.Timestamp('2020-06-29'), 'difference'): {0: 0.0,
  1: 0.0,
  2: 0.0,
  3: 0.0,
  4: 0.0,
  5: 0.0,
  6: 0.0,
  7: 0.0,
  8: 0.0,
  9: 0.0,
  10: 0.0,
  11: 0.0,
  12: 0.0,
  13: 0.0,
  14: 0.0,
  15: 0.0,
  16: 0.0,
  17: 0.0,
  18: 0.0,
  19: 0.0},
 (pd.Timestamp('2020-07-06'), 'last_sales'): {0: 26.0,
  1: 39.0,
  2: 79.0,
  3: 49.0,
  4: 10.0,
  5: 10.0,
  6: 436.0,
  7: 5.0,
  8: 218.0,
  9: 89.0,
  10: 34.0,
  11: 133.0,
  12: 66.0,
  13: 21.0,
  14: 20.0,
  15: 732.0,
  16: 3.0,
  17: 366.0,
  18: 2336.0,
  19: 2336.0},
 (pd.Timestamp('2020-07-06'), 'sales'): {0: 3978.15,
  1: 12138.96,
  2: 19084.175,
  3: 40033.46000000001,
  4: 4280.15,
  5: 1495.1,
  6: 165548.29,
  7: 1764.15,
  8: 82774.145,
  9: 8314.92,
  10: 12776.649999999996,
  11: 28048.075,
  12: 55104.21000000002,
  13: 6962.844999999999,
  14: 3053.2000000000003,
  15: 231049.11000000002,
  16: 1264.655,
  17: 115524.55500000001,
  18: 793194.8000000002,
  19: 793194.8000000002},
 (pd.Timestamp('2020-07-06'), 'difference'): {0: 0.0,
  1: 0.0,
  2: 0.0,
  3: 0.0,
  4: 0.0,
  5: 0.0,
  6: 0.0,
  7: 0.0,
  8: 0.0,
  9: 0.0,
  10: 0.0,
  11: 0.0,
  12: 0.0,
  13: 0.0,
  14: 0.0,
  15: 0.0,
  16: 0.0,
  17: 0.0,
  18: 0.0,
  19: 0.0},
 (pd.Timestamp('2021-06-28'), 'last_sales'): {0: 96.0,
  1: 56.0,
  2: 106.0,
  3: 44.0,
  4: 34.0,
  5: 13.0,
  6: 716.0,
  7: 9.0,
  8: 358.0,
  9: 101.0,
  10: 22.0,
  11: 120.0,
  12: 40.0,
  13: 13.0,
  14: 8.0,
  15: 610.0,
  16: 1.0,
  17: 305.0,
  18: 2652.0,
  19: 2652.0},
 (pd.Timestamp('2021-06-28'), 'sales'): {0: 5194.95,
  1: 19102.219999999994,
  2: 22796.420000000002,
  3: 30853.115,
  4: 11461.25,
  5: 992.6,
  6: 188143.41,
  7: 3671.15,
  8: 94071.705,
  9: 6022.299999999998,
  10: 7373.6,
  11: 33514.0,
  12: 35943.45,
  13: 4749.000000000001,
  14: 902.01,
  15: 177707.32,
  16: 349.3,
  17: 88853.66,
  18: 731701.46,
  19: 731701.46},
 (pd.Timestamp('2021-06-28'), 'difference'): {0: 0.0,
  1: 0.0,
  2: 0.0,
  3: 0.0,
  4: 0.0,
  5: 0.0,
  6: 0.0,
  7: 0.0,
  8: 0.0,
  9: 0.0,
  10: 0.0,
  11: 0.0,
  12: 0.0,
  13: 0.0,
  14: 0.0,
  15: 0.0,
  16: 0.0,
  17: 0.0,
  18: 0.0,
  19: 0.0},
 (pd.Timestamp('2021-07-07'), 'last_sales'): {0: 45.0,
  1: 47.0,
  2: 87.0,
  3: 45.0,
  4: 13.0,
  5: 8.0,
  6: 494.0,
  7: 2.0,
  8: 247.0,
  9: 81.0,
  10: 36.0,
  11: 143.0,
  12: 56.0,
  13: 9.0,
  14: 9.0,
  15: 670.0,
  16: 1.0,
  17: 335.0,
  18: 2328.0,
  19: 2328.0},
 (pd.Timestamp('2021-07-07'), 'sales'): {0: 7556.414999999998,
  1: 14985.05,
  2: 16790.899999999998,
  3: 36202.729999999996,
  4: 4024.97,
  5: 1034.45,
  6: 163960.32999999996,
  7: 1385.65,
  8: 81980.16499999998,
  9: 5600.544999999999,
  10: 11209.92,
  11: 32832.61,
  12: 42137.44500000001,
  13: 3885.1499999999996,
  14: 1191.5,
  15: 194912.34000000003,
  16: 599.0,
  17: 97456.17000000001,
  18: 717745.3400000001,
  19: 717745.3400000001},
 (pd.Timestamp('2021-07-07'), 'difference'): {0: 0.0,
  1: 0.0,
  2: 0.0,
  3: 0.0,
  4: 0.0,
  5: 0.0,
  6: 0.0,
  7: 0.0,
  8: 0.0,
  9: 0.0,
  10: 0.0,
  11: 0.0,
  12: 0.0,
  13: 0.0,
  14: 0.0,
  15: 0.0,
  16: 0.0,
  17: 0.0,
  18: 0.0,
  19: 0.0}}).set_index(['group','category'])

遵循以下结构:

                    2020-06-29                                       2020-07-05

group   category    last_sales      sales       difference
A       Amazon      195             1268        0    
        Apple       61              18247       0
        Facebook    106             19722       0
        ...

B       Amazon      50              3219        0
        Apple       50              15852       0
        Facebook    75              17743       0
        ... 

我正在尝试将 sales, last_sales, difference 移到类别旁边。所以目前这些是每周的专栏,我试图让它们成为某种索引,其中 df 看起来像这样:

                                        2020-06-29            2020-07-05

group   category    
A       Amazon     last_sales           195                   ...  
                   sales                1268                  ...  
                   difference           0                     ...
        Apple      last_sales           61                    ...
                   sales                18247                 ...
                   difference           0                     ...
        Facebook   last_sales           106                   ...
                   sales                19722                 ...
                   difference           0                     ...
        ...

B       Amazon     last_sales           50                    ...
                   sales                3219                  ...
                   difference           0                     ...
        Apple      last_sales           50                    ...
                   sales                15852                 ...
                   difference           0                     ...
        Facebook   last_sales           75                    ...
                   sales                17743                 ...
                   difference           0                     ...
        ... 

我一直在尝试使用 .unstack(), .reset_index() 来实现这一点,但没有成功,我不确定应该如何处理这个问题。 我无法解决的问题来自以下事实:

df_test.columns

Returns:

MultiIndex([(2020-06-29 00:00:00, 'last_sales'),
            (2020-06-29 00:00:00,      'sales'),
            (2020-06-29 00:00:00, 'difference'),
            (2020-07-06 00:00:00, 'last_sales'),
            (2020-07-06 00:00:00,      'sales'),
            (2020-07-06 00:00:00, 'difference'),
            (2021-06-28 00:00:00, 'last_sales'),
            (2021-06-28 00:00:00,      'sales'),
            (2021-06-28 00:00:00, 'difference'),
            (2021-07-07 00:00:00, 'last_sales'),
            (2021-07-07 00:00:00,      'sales'),
            (2021-07-07 00:00:00, 'difference')],
           )

所以一方面,日期应该保留为一列,另一方面,sales, last_sales and difference 应该成为一个新列(或索引,我不确定,但我希望我想要的输出能解释它).

如果原始顺序不重要,请使用 DataFrame.stack:

df = df_test.stack()
print (df.head())
                           2020-06-29 00:00:00  2020-07-06 00:00:00  \
group category                                                        
A     Amazon   difference                 0.00                 0.00   
               last_sales               195.00                26.00   
               sales                   1268.85              3978.15   
      Apple    difference                 0.00                 0.00   
               last_sales                61.00                39.00   

                           2021-06-28 00:00:00  2021-07-07 00:00:00  
group category                                                       
A     Amazon   difference                 0.00                0.000  
               last_sales                96.00               45.000  
               sales                   5194.95             7556.415  
      Apple    difference                 0.00                0.000  
               last_sales                56.00               47.000  

原始订购使用:

#because sample data
#df_test.columns = df_test.columns.remove_unused_levels()

cats = pd.CategoricalIndex(df_test.columns.levels[1], 
                      ordered=True, 
                      categories=['last_sales','sales','difference'])

df_test.columns = df_test.columns.set_levels(cats, level=1)

df = df_test.stack()
print (df.head())
                           2020-06-29 00:00:00  2020-07-06 00:00:00  \
group category                                                        
A     Amazon   last_sales              195.000                26.00   
               sales                  1268.850              3978.15   
               difference                0.000                 0.00   
      Apple    last_sales               61.000                39.00   
               sales                 18274.385             12138.96   

                           2021-06-28 00:00:00  2021-07-07 00:00:00  
group category                                                       
A     Amazon   last_sales                96.00               45.000  
               sales                   5194.95             7556.415  
               difference                 0.00                0.000  
      Apple    last_sales                56.00               47.000  
               sales                  19102.22            14985.050