在多索引数据框中减去值并计算百分比
Subtracting values & calculating percentages in multiindex dataframe
我有一个多索引数据框df
:
df = pd.DataFrame.from_dict({('group', ''): {0: 'A',
1: 'A',
2: 'A',
3: 'A',
4: 'A',
5: 'A',
6: 'A',
7: 'B',
8: 'B',
9: 'B',
10: 'B',
11: 'B',
12: 'B',
13: 'B'},
('category', ''): {0: 'Books',
1: 'Candy',
2: 'Pencil',
3: 'Table',
4: 'PC',
5: 'Printer',
6: 'Lamp',
7: 'Books',
8: 'Candy',
9: 'Pencil',
10: 'Table',
11: 'PC',
12: 'Printer',
13: 'Lamp'},
(pd.Timestamp('2021-06-28 00:00:00'),
'Sales_1'): {0: 9.937449997200002, 1: 30.71300000639998, 2: 58.81199999639999, 3: 25.661999978399994, 4: 3.657999996, 5: 12.0879999972, 6: 61.16600000040001, 7: 6.319439989199998, 8: 12.333119997600003, 9: 24.0544100028, 10: 24.384659998799997, 11: 1.9992000012000002, 12: 0.324, 13: 40.69122000000001},
(pd.Timestamp('2021-06-28 00:00:00'),
'Sales_2'): {0: 21.890370397789923, 1: 28.300470581874837, 2: 53.52039700062155, 3: 52.425508769690694, 4: 6.384936971649232, 5: 6.807138946302334, 6: 52.172, 7: 5.916852561, 8: 5.810764652, 9: 12.1243325, 10: 17.88071596, 11: 0.913782413, 12: 0.869207661, 13: 20.9447844},
(pd.Timestamp('2021-06-28 00:00:00'), 'last_week_sales'): {0: np.nan,
1: np.nan,
2: np.nan,
3: np.nan,
4: np.nan,
5: np.nan,
6: np.nan,
7: np.nan,
8: np.nan,
9: np.nan,
10: np.nan,
11: np.nan,
12: np.nan,
13: np.nan},
(pd.Timestamp('2021-06-28 00:00:00'), 'total_orders'): {0: 86.0,
1: 66.0,
2: 188.0,
3: 556.0,
4: 12.0,
5: 4.0,
6: 56.0,
7: 90.0,
8: 26.0,
9: 49.0,
10: 250.0,
11: 7.0,
12: 2.0,
13: 44.0},
(pd.Timestamp('2021-06-28 00:00:00'), 'total_sales'): {0: 4390.11,
1: 24825.059999999998,
2: 48592.39999999998,
3: 60629.77,
4: 831.22,
5: 1545.71,
6: 34584.99,
7: 5641.54,
8: 6798.75,
9: 13290.13,
10: 42692.68000000001,
11: 947.65,
12: 329.0,
13: 29889.65},
(pd.Timestamp('2021-07-05 00:00:00'),
'Sales_1'): {0: 13.690399997999998, 1: 38.723000005199985, 2: 72.4443400032, 3: 36.75802000560001, 4: 5.691999996, 5: 7.206999998399999, 6: 66.55265999039996, 7: 6.4613199911999954, 8: 12.845630001599998, 9: 26.032340003999998, 10: 30.1634600016, 11: 1.0203399996, 12: 1.4089999991999997, 13: 43.67116000320002},
(pd.Timestamp('2021-07-05 00:00:00'),
'Sales_2'): {0: 22.874363860953647, 1: 29.5726042895728, 2: 55.926190956481534, 3: 54.7820864335212, 4: 6.671946105284065, 5: 7.113126469779095, 6: 54.517, 7: 6.194107518, 8: 6.083562133, 9: 12.69221484, 10: 18.71872129, 11: 0.956574175, 12: 0.910216433, 13: 21.92632044},
(pd.Timestamp('2021-07-05 00:00:00'), 'last_week_sales'): {0: 4390.11,
1: 24825.059999999998,
2: 48592.39999999998,
3: 60629.77,
4: 831.22,
5: 1545.71,
6: 34584.99,
7: 5641.54,
8: 6798.75,
9: 13290.13,
10: 42692.68000000001,
11: 947.65,
12: 329.0,
13: 29889.65},
(pd.Timestamp('2021-07-05 00:00:00'), 'total_orders'): {0: 109.0,
1: 48.0,
2: 174.0,
3: 587.0,
4: 13.0,
5: 5.0,
6: 43.0,
7: 62.0,
8: 13.0,
9: 37.0,
10: 196.0,
11: 8.0,
12: 1.0,
13: 33.0},
(pd.Timestamp('2021-07-05 00:00:00'), 'total_sales'): {0: 3453.02,
1: 17868.730000000003,
2: 44707.82999999999,
3: 60558.97999999999,
4: 1261.0,
5: 1914.6000000000001,
6: 24146.09,
7: 6201.489999999999,
8: 5513.960000000001,
9: 9645.87,
10: 25086.785,
11: 663.0,
12: 448.61,
13: 26332.7}}).set_index(['group','category'])
我正在尝试为每个 date
获取一个列,这将是 Sales_2*1000 - total_sales
并计算类别如何按 total_sales
划分百分比,这将是 sum
每周 total_sales
除以每个 category
销售额 *100
。
我尝试过的:
df['diff'] = df.loc[:,(slice(None),'total_sales')] - df.loc[:,(slice(None),'Sales_2')]
但是我明白了
ValueError: Wrong number of items passed 4, placement implies 1
因为这试图将 4 列放入 1,而不是每个 date
列的结果。对于 total_sales
每个 category
和 date
的总百分比:
df.loc[:,(slice(None),'total_sales')].groupby(level=['group','category']).apply(lambda x: 100 * x / x.sum())
但是所有的值都变成了 100
所以我不确定我怎么能在 total_sales
旁边有一个列,看起来像这样:
2021-06-28 00:00:00 2021-07-05 00:00:00
total_sales %_split difference total_sales %_split difference
group category
A Books 4,390.110 9% ... 3,453.020 ... ...
Candy 24,825.060 11% ... 17,868.730 ... ...
Pencil 48,592.400 10% ... 44,707.830 ... ...
Table 60,629.770 40% ... 60,558.980 ... ...
PC 831.220 3% ... 1,261.000 ... ...
Printer 1,545.710 7% ... 1,914.600 ... ...
Lamp 34,584.990 30% ... 24,146.090 ... ...
B Books 5,641.540 ... ... 6,201.490 ... ...
Candy 6,798.750 ... ... 5,513.960 ... ...
Pencil 13,290.130 ... ... 9,645.870 ... ...
Table 42,692.680 ... ... 25,086.785 ... ...
PC 947.650 ... ... 663.000 ... ...
Printer 329.000 ... ... 448.610 ... ...
Lamp 29,889.650 ... ... 26,332.700 ... ...
difference
是 total_sales - sales_2*1000
,我只包括了 2 列以提高可见性,实际上,我需要 df
中存在的所有列加上每个列的 2 个附加列每个 date
列。
我们可以试试
s = df.stack(level=0)
s['diff'] = s.eval('total_sales - Sales_2 * 1000')
sales_per_group = s['total_sales'].groupby(level=[0, 2]).transform('sum')
s['split %'] = s['total_sales'] / sales_per_group * 100
s = s.stack(dropna=False).unstack([2, 3])
print(s)
2021-06-28 00:00:00 2021-07-05 00:00:00
Sales_1 Sales_2 last_week_sales total_orders total_sales diff split % Sales_1 Sales_2 last_week_sales total_orders total_sales diff split %
group category
A Books 9.93745 21.890370 NaN 86.0 4390.11 -17500.260398 2.502924 13.69040 22.874364 4390.11 109.0 3453.020 -19421.343861 2.243528
Candy 30.71300 28.300471 NaN 66.0 24825.06 -3475.410582 14.153458 38.72300 29.572604 24825.06 48.0 17868.730 -11703.874290 11.609838
Lamp 61.16600 52.172000 NaN 56.0 34584.99 -17587.010000 19.717865 66.55266 54.517000 34584.99 43.0 24146.090 -30370.910000 15.688422
PC 3.65800 6.384937 NaN 12.0 831.22 -5553.716972 0.473902 5.69200 6.671946 831.22 13.0 1261.000 -5410.946105 0.819309
Pencil 58.81200 53.520397 NaN 188.0 48592.40 -4927.997001 27.703880 72.44434 55.926191 48592.40 174.0 44707.830 -11218.360956 29.047987
Printer 12.08800 6.807139 NaN 4.0 1545.71 -5261.428946 0.881252 7.20700 7.113126 1545.71 5.0 1914.600 -5198.526470 1.243972
Table 25.66200 52.425509 NaN 556.0 60629.77 8204.261230 34.566719 36.75802 54.782086 60629.77 587.0 60558.980 5776.893566 39.346944
B Books 6.31944 5.916853 NaN 90.0 5641.54 -275.312561 5.664800 6.46132 6.194108 5641.54 62.0 6201.490 7.382482 8.392593
Candy 12.33312 5.810765 NaN 26.0 6798.75 987.985348 6.826781 12.84563 6.083562 6798.75 13.0 5513.960 -569.602133 7.462146
Lamp 40.69122 20.944784 NaN 44.0 29889.65 8944.865600 30.012883 43.67116 21.926320 29889.65 33.0 26332.700 4406.379560 35.636540
PC 1.99920 0.913782 NaN 7.0 947.65 33.867587 0.951557 1.02034 0.956574 947.65 8.0 663.000 -293.574175 0.897250
Pencil 24.05441 12.124332 NaN 49.0 13290.13 1165.797500 13.344924 26.03234 12.692215 13290.13 37.0 9645.870 -3046.344840 13.053938
Printer 0.32400 0.869208 NaN 2.0 329.00 -540.207661 0.330356 1.40900 0.910216 329.00 1.0 448.610 -461.606433 0.607112
Table 24.38466 17.880716 NaN 250.0 42692.68 24811.964040 42.868699 30.16346 18.718721 42692.68 196.0 25086.785 6368.063710 33.950420
对于计算差异时的第一个问题,这应该可以解决问题,它需要几行代码,但适用于您需要的任意多个日期。只需将差异计算为一个 numpy 数组,然后将它们放在相应的列中。
tmp = 1000*df.loc[:,(slice(None),'Sales_2')].values + df.loc[:,(slice(None),'total_sales')].values
date_index = [i[0] for i in df.columns if 'total_sales' == i[1]]
for i,di in enumerate(date_index):
df[(di,'diff')] = tmp[:,i]
df = df.reindex(sorted(df.columns), axis=1)
关于 groupby 的第二个问题。您的代码一次只将一行传递给 lambda 函数,因此得到 100% 是正常的,因为 x 等于 x.sum.
如果你想要每组 (A,B) 的百分比,你也不应该按类别分组。这足够了:
df.loc[:,(slice(None),'total_sales')].groupby(level=['group']).apply(lambda x: 100 * x / x.sum())
如果您想要百分比而不考虑组或类别,只需循环即可。
for i,di in enumerate(date_index):
df[(di,'%_split')] = 100*df[(di,'total_sales')]/df[(di,'total_sales')].sum()
df = df.reindex(sorted(df.columns), axis=1)
这是使用您原来的方法的解决方案:
首先为差异计算数据帧:
df_diff = pd.concat({'diff': (1000*df.loc[:,(slice(None),'Sales_2')].droplevel(axis=1, level=1)
-df.loc[:,(slice(None),'total_sales')].droplevel(axis=1, level=1)
)
}, axis=1).swaplevel(axis=1)
df_diff
输出:
2021-06-28 00:00:00 2021-07-05 00:00:00
diff diff
group category
A Books 17500.260398 19421.343861
Candy 3475.410582 11703.874290
Pencil 4927.997001 11218.360956
Table -8204.261230 -5776.893566
PC 5553.716972 5410.946105
Printer 5261.428946 5198.526470
Lamp 17587.010000 30370.910000
B Books 275.312561 -7.382482
Candy -987.985348 569.602133
Pencil -1165.797500 3046.344840
Table -24811.964040 -6368.063710
PC -33.867587 293.574175
Printer 540.207661 461.606433
Lamp -8944.865600 -4406.379560
然后计算百分比的数据框:
df_percent = (df.loc[:,(slice(None),'total_sales')]
.groupby('group').apply(lambda x: 100*x/x.sum())
).rename({'total_sales': '%sales'}, axis=1, level=1)
df_percent
输出:
2021-06-28 2021-07-05
%sales %sales
group category
A Books 2.502924 2.243528
Candy 14.153458 11.609838
Pencil 27.703880 29.047987
Table 34.566719 39.346944
PC 0.473902 0.819309
Printer 0.881252 1.243972
Lamp 19.717865 15.688422
B Books 5.664800 8.392593
Candy 6.826781 7.462146
Pencil 13.344924 13.053938
Table 42.868699 33.950420
PC 0.951557 0.897250
Printer 0.330356 0.607112
Lamp 30.012883 35.636540
最后,合并所有内容:
pd.concat([df, df_diff, df_percent], axis=1).sort_index(axis=1, level=0)
我的解决方案基于以下想法:
- 按列索引的 0 级对源 DataFrame 进行分组,
- 为每个组生成一个“部分”DataFrame,添加两个新列,
- 水平连接所有部分结果。
要为每个组生成“部分”DataFrame(步骤 2),请定义以下函数:
def addCols(grp):
dff = (grp.loc[:,(slice(None),'Sales_2')] * 1000).values\
- grp.loc[:,(slice(None),'total_sales')].values
wrk = grp.loc[:,(slice(None), 'total_sales')]
pct = (wrk * 100 / wrk.groupby(level=0).sum()).values
dd = grp.columns[0][0]
return grp.join(pd.DataFrame(np.hstack([pct, dff]), columns=pd.MultiIndex
.from_tuples([(dd, 'Pct'), (dd, 'Diff')]), index=grp.index))
然后运行:
result = pd.concat([ addCols(grp) for (_, grp) in df.groupby(axis=1, level=0) ], axis=1)
结果太宽,无法包含在此处,但是当您 运行 上面的代码时,您将看到结果。
如果需要,请根据需要更改新的列名称。
我有一个多索引数据框df
:
df = pd.DataFrame.from_dict({('group', ''): {0: 'A',
1: 'A',
2: 'A',
3: 'A',
4: 'A',
5: 'A',
6: 'A',
7: 'B',
8: 'B',
9: 'B',
10: 'B',
11: 'B',
12: 'B',
13: 'B'},
('category', ''): {0: 'Books',
1: 'Candy',
2: 'Pencil',
3: 'Table',
4: 'PC',
5: 'Printer',
6: 'Lamp',
7: 'Books',
8: 'Candy',
9: 'Pencil',
10: 'Table',
11: 'PC',
12: 'Printer',
13: 'Lamp'},
(pd.Timestamp('2021-06-28 00:00:00'),
'Sales_1'): {0: 9.937449997200002, 1: 30.71300000639998, 2: 58.81199999639999, 3: 25.661999978399994, 4: 3.657999996, 5: 12.0879999972, 6: 61.16600000040001, 7: 6.319439989199998, 8: 12.333119997600003, 9: 24.0544100028, 10: 24.384659998799997, 11: 1.9992000012000002, 12: 0.324, 13: 40.69122000000001},
(pd.Timestamp('2021-06-28 00:00:00'),
'Sales_2'): {0: 21.890370397789923, 1: 28.300470581874837, 2: 53.52039700062155, 3: 52.425508769690694, 4: 6.384936971649232, 5: 6.807138946302334, 6: 52.172, 7: 5.916852561, 8: 5.810764652, 9: 12.1243325, 10: 17.88071596, 11: 0.913782413, 12: 0.869207661, 13: 20.9447844},
(pd.Timestamp('2021-06-28 00:00:00'), 'last_week_sales'): {0: np.nan,
1: np.nan,
2: np.nan,
3: np.nan,
4: np.nan,
5: np.nan,
6: np.nan,
7: np.nan,
8: np.nan,
9: np.nan,
10: np.nan,
11: np.nan,
12: np.nan,
13: np.nan},
(pd.Timestamp('2021-06-28 00:00:00'), 'total_orders'): {0: 86.0,
1: 66.0,
2: 188.0,
3: 556.0,
4: 12.0,
5: 4.0,
6: 56.0,
7: 90.0,
8: 26.0,
9: 49.0,
10: 250.0,
11: 7.0,
12: 2.0,
13: 44.0},
(pd.Timestamp('2021-06-28 00:00:00'), 'total_sales'): {0: 4390.11,
1: 24825.059999999998,
2: 48592.39999999998,
3: 60629.77,
4: 831.22,
5: 1545.71,
6: 34584.99,
7: 5641.54,
8: 6798.75,
9: 13290.13,
10: 42692.68000000001,
11: 947.65,
12: 329.0,
13: 29889.65},
(pd.Timestamp('2021-07-05 00:00:00'),
'Sales_1'): {0: 13.690399997999998, 1: 38.723000005199985, 2: 72.4443400032, 3: 36.75802000560001, 4: 5.691999996, 5: 7.206999998399999, 6: 66.55265999039996, 7: 6.4613199911999954, 8: 12.845630001599998, 9: 26.032340003999998, 10: 30.1634600016, 11: 1.0203399996, 12: 1.4089999991999997, 13: 43.67116000320002},
(pd.Timestamp('2021-07-05 00:00:00'),
'Sales_2'): {0: 22.874363860953647, 1: 29.5726042895728, 2: 55.926190956481534, 3: 54.7820864335212, 4: 6.671946105284065, 5: 7.113126469779095, 6: 54.517, 7: 6.194107518, 8: 6.083562133, 9: 12.69221484, 10: 18.71872129, 11: 0.956574175, 12: 0.910216433, 13: 21.92632044},
(pd.Timestamp('2021-07-05 00:00:00'), 'last_week_sales'): {0: 4390.11,
1: 24825.059999999998,
2: 48592.39999999998,
3: 60629.77,
4: 831.22,
5: 1545.71,
6: 34584.99,
7: 5641.54,
8: 6798.75,
9: 13290.13,
10: 42692.68000000001,
11: 947.65,
12: 329.0,
13: 29889.65},
(pd.Timestamp('2021-07-05 00:00:00'), 'total_orders'): {0: 109.0,
1: 48.0,
2: 174.0,
3: 587.0,
4: 13.0,
5: 5.0,
6: 43.0,
7: 62.0,
8: 13.0,
9: 37.0,
10: 196.0,
11: 8.0,
12: 1.0,
13: 33.0},
(pd.Timestamp('2021-07-05 00:00:00'), 'total_sales'): {0: 3453.02,
1: 17868.730000000003,
2: 44707.82999999999,
3: 60558.97999999999,
4: 1261.0,
5: 1914.6000000000001,
6: 24146.09,
7: 6201.489999999999,
8: 5513.960000000001,
9: 9645.87,
10: 25086.785,
11: 663.0,
12: 448.61,
13: 26332.7}}).set_index(['group','category'])
我正在尝试为每个 date
获取一个列,这将是 Sales_2*1000 - total_sales
并计算类别如何按 total_sales
划分百分比,这将是 sum
每周 total_sales
除以每个 category
销售额 *100
。
我尝试过的:
df['diff'] = df.loc[:,(slice(None),'total_sales')] - df.loc[:,(slice(None),'Sales_2')]
但是我明白了
ValueError: Wrong number of items passed 4, placement implies 1
因为这试图将 4 列放入 1,而不是每个 date
列的结果。对于 total_sales
每个 category
和 date
的总百分比:
df.loc[:,(slice(None),'total_sales')].groupby(level=['group','category']).apply(lambda x: 100 * x / x.sum())
但是所有的值都变成了 100
所以我不确定我怎么能在 total_sales
旁边有一个列,看起来像这样:
2021-06-28 00:00:00 2021-07-05 00:00:00
total_sales %_split difference total_sales %_split difference
group category
A Books 4,390.110 9% ... 3,453.020 ... ...
Candy 24,825.060 11% ... 17,868.730 ... ...
Pencil 48,592.400 10% ... 44,707.830 ... ...
Table 60,629.770 40% ... 60,558.980 ... ...
PC 831.220 3% ... 1,261.000 ... ...
Printer 1,545.710 7% ... 1,914.600 ... ...
Lamp 34,584.990 30% ... 24,146.090 ... ...
B Books 5,641.540 ... ... 6,201.490 ... ...
Candy 6,798.750 ... ... 5,513.960 ... ...
Pencil 13,290.130 ... ... 9,645.870 ... ...
Table 42,692.680 ... ... 25,086.785 ... ...
PC 947.650 ... ... 663.000 ... ...
Printer 329.000 ... ... 448.610 ... ...
Lamp 29,889.650 ... ... 26,332.700 ... ...
difference
是 total_sales - sales_2*1000
,我只包括了 2 列以提高可见性,实际上,我需要 df
中存在的所有列加上每个列的 2 个附加列每个 date
列。
我们可以试试
s = df.stack(level=0)
s['diff'] = s.eval('total_sales - Sales_2 * 1000')
sales_per_group = s['total_sales'].groupby(level=[0, 2]).transform('sum')
s['split %'] = s['total_sales'] / sales_per_group * 100
s = s.stack(dropna=False).unstack([2, 3])
print(s)
2021-06-28 00:00:00 2021-07-05 00:00:00
Sales_1 Sales_2 last_week_sales total_orders total_sales diff split % Sales_1 Sales_2 last_week_sales total_orders total_sales diff split %
group category
A Books 9.93745 21.890370 NaN 86.0 4390.11 -17500.260398 2.502924 13.69040 22.874364 4390.11 109.0 3453.020 -19421.343861 2.243528
Candy 30.71300 28.300471 NaN 66.0 24825.06 -3475.410582 14.153458 38.72300 29.572604 24825.06 48.0 17868.730 -11703.874290 11.609838
Lamp 61.16600 52.172000 NaN 56.0 34584.99 -17587.010000 19.717865 66.55266 54.517000 34584.99 43.0 24146.090 -30370.910000 15.688422
PC 3.65800 6.384937 NaN 12.0 831.22 -5553.716972 0.473902 5.69200 6.671946 831.22 13.0 1261.000 -5410.946105 0.819309
Pencil 58.81200 53.520397 NaN 188.0 48592.40 -4927.997001 27.703880 72.44434 55.926191 48592.40 174.0 44707.830 -11218.360956 29.047987
Printer 12.08800 6.807139 NaN 4.0 1545.71 -5261.428946 0.881252 7.20700 7.113126 1545.71 5.0 1914.600 -5198.526470 1.243972
Table 25.66200 52.425509 NaN 556.0 60629.77 8204.261230 34.566719 36.75802 54.782086 60629.77 587.0 60558.980 5776.893566 39.346944
B Books 6.31944 5.916853 NaN 90.0 5641.54 -275.312561 5.664800 6.46132 6.194108 5641.54 62.0 6201.490 7.382482 8.392593
Candy 12.33312 5.810765 NaN 26.0 6798.75 987.985348 6.826781 12.84563 6.083562 6798.75 13.0 5513.960 -569.602133 7.462146
Lamp 40.69122 20.944784 NaN 44.0 29889.65 8944.865600 30.012883 43.67116 21.926320 29889.65 33.0 26332.700 4406.379560 35.636540
PC 1.99920 0.913782 NaN 7.0 947.65 33.867587 0.951557 1.02034 0.956574 947.65 8.0 663.000 -293.574175 0.897250
Pencil 24.05441 12.124332 NaN 49.0 13290.13 1165.797500 13.344924 26.03234 12.692215 13290.13 37.0 9645.870 -3046.344840 13.053938
Printer 0.32400 0.869208 NaN 2.0 329.00 -540.207661 0.330356 1.40900 0.910216 329.00 1.0 448.610 -461.606433 0.607112
Table 24.38466 17.880716 NaN 250.0 42692.68 24811.964040 42.868699 30.16346 18.718721 42692.68 196.0 25086.785 6368.063710 33.950420
对于计算差异时的第一个问题,这应该可以解决问题,它需要几行代码,但适用于您需要的任意多个日期。只需将差异计算为一个 numpy 数组,然后将它们放在相应的列中。
tmp = 1000*df.loc[:,(slice(None),'Sales_2')].values + df.loc[:,(slice(None),'total_sales')].values
date_index = [i[0] for i in df.columns if 'total_sales' == i[1]]
for i,di in enumerate(date_index):
df[(di,'diff')] = tmp[:,i]
df = df.reindex(sorted(df.columns), axis=1)
关于 groupby 的第二个问题。您的代码一次只将一行传递给 lambda 函数,因此得到 100% 是正常的,因为 x 等于 x.sum.
如果你想要每组 (A,B) 的百分比,你也不应该按类别分组。这足够了:
df.loc[:,(slice(None),'total_sales')].groupby(level=['group']).apply(lambda x: 100 * x / x.sum())
如果您想要百分比而不考虑组或类别,只需循环即可。
for i,di in enumerate(date_index):
df[(di,'%_split')] = 100*df[(di,'total_sales')]/df[(di,'total_sales')].sum()
df = df.reindex(sorted(df.columns), axis=1)
这是使用您原来的方法的解决方案:
首先为差异计算数据帧:
df_diff = pd.concat({'diff': (1000*df.loc[:,(slice(None),'Sales_2')].droplevel(axis=1, level=1)
-df.loc[:,(slice(None),'total_sales')].droplevel(axis=1, level=1)
)
}, axis=1).swaplevel(axis=1)
df_diff
输出:
2021-06-28 00:00:00 2021-07-05 00:00:00
diff diff
group category
A Books 17500.260398 19421.343861
Candy 3475.410582 11703.874290
Pencil 4927.997001 11218.360956
Table -8204.261230 -5776.893566
PC 5553.716972 5410.946105
Printer 5261.428946 5198.526470
Lamp 17587.010000 30370.910000
B Books 275.312561 -7.382482
Candy -987.985348 569.602133
Pencil -1165.797500 3046.344840
Table -24811.964040 -6368.063710
PC -33.867587 293.574175
Printer 540.207661 461.606433
Lamp -8944.865600 -4406.379560
然后计算百分比的数据框:
df_percent = (df.loc[:,(slice(None),'total_sales')]
.groupby('group').apply(lambda x: 100*x/x.sum())
).rename({'total_sales': '%sales'}, axis=1, level=1)
df_percent
输出:
2021-06-28 2021-07-05
%sales %sales
group category
A Books 2.502924 2.243528
Candy 14.153458 11.609838
Pencil 27.703880 29.047987
Table 34.566719 39.346944
PC 0.473902 0.819309
Printer 0.881252 1.243972
Lamp 19.717865 15.688422
B Books 5.664800 8.392593
Candy 6.826781 7.462146
Pencil 13.344924 13.053938
Table 42.868699 33.950420
PC 0.951557 0.897250
Printer 0.330356 0.607112
Lamp 30.012883 35.636540
最后,合并所有内容:
pd.concat([df, df_diff, df_percent], axis=1).sort_index(axis=1, level=0)
我的解决方案基于以下想法:
- 按列索引的 0 级对源 DataFrame 进行分组,
- 为每个组生成一个“部分”DataFrame,添加两个新列,
- 水平连接所有部分结果。
要为每个组生成“部分”DataFrame(步骤 2),请定义以下函数:
def addCols(grp):
dff = (grp.loc[:,(slice(None),'Sales_2')] * 1000).values\
- grp.loc[:,(slice(None),'total_sales')].values
wrk = grp.loc[:,(slice(None), 'total_sales')]
pct = (wrk * 100 / wrk.groupby(level=0).sum()).values
dd = grp.columns[0][0]
return grp.join(pd.DataFrame(np.hstack([pct, dff]), columns=pd.MultiIndex
.from_tuples([(dd, 'Pct'), (dd, 'Diff')]), index=grp.index))
然后运行:
result = pd.concat([ addCols(grp) for (_, grp) in df.groupby(axis=1, level=0) ], axis=1)
结果太宽,无法包含在此处,但是当您 运行 上面的代码时,您将看到结果。
如果需要,请根据需要更改新的列名称。