多索引数据框中数据的年份总和
Sum of year till data in a multiindex dataframe
我有一个多索引数据框
df
2021-06-28 2021-07-05
total_orders total_sales last_year_sales last_year_total_orders total_orders total_sales last_year_sales last_year_total_orders
group category
A Amazon 195.000 1,268.850 5,194.950 195.000 ... ... ... ...
Netflix 37.000 15,323.800 11,461.250 20.00
Apple 61.000 18,274.385 19,102.220 30.00
Facebook 106.000 19,722.650 22,796.420 50.00
Tesla 13.000 1,688.675 992.600 25.00
Uber 4.000 1,906.000 3,671.150 15.00
Google 61.000 55,547.255 30,853.115 10.00
total 477.000 113,731.615 94,071.705 56.00
B Amazon 50.000 3,219.650 6,022.300 400.00
Netflix 17.000 5,918.500 4,749.000 30.00
Apple 50.000 15,852.060 7,373.600 27.00
Facebook 75.000 17,743.700 33,514.000 15.00
Tesla 14.000 1,708.750 902.010 66.00
Uber 3.000 937.010 349.300 94.00
Google 43.000 37,795.150 35,943.450 65.00
total 252.000 83,174.820 88,853.660 61.00
我正在尝试创建一个 year_to_date
列,其中 sum
将是从 2021-06-28
到今天的 total_sales
的 sum
。正如您在上面看到的那样,我每周有 4 个专栏,从 2020-06-29
到本周 2021-10-11
。因此,从 2021-06-28
开始的每个星期,我都会有一个列,它只包含从 2021-06-28
.
开始的所有前几周的 sales
的累计总和
我希望结果看起来像这样(我删除了一些列以获得更好的视觉效果):
2021-06-28 2021-07-05 2021-07-12
total_orders total_sales total_orders total_sales year_to_date_sales total_orders total_sales year_to_date_sales
group category
A Amazon 195.000 1,268.850 ... 1000 1,268.850 + 1000 ... 5000 1,268.850 + 1000 + 5000
Netflix 37.000 15,323.800
Apple 61.000 18,274.385 ... 2000 15,323.800 + 2000 ... 6000 1,268.850 + 1000 + 6000
Facebook 106.000 19,722.650
Tesla 13.000 1,688.675
Uber 4.000 1,906.000
Google 61.000 55,547.255
total 477.000 113,731.615
B Amazon 50.000 3,219.650
Netflix 17.000 5,918.500
Apple 50.000 15,852.060
Facebook 75.000 17,743.700
Tesla 14.000 1,708.750
Uber 3.000 937.010
Google 43.000 37,795.150
total 252.000 83,174.820
我试过了:
# Adding year to date sales
s = df.loc[:, (['2021-06-28','2021-10-11'], 'total_sales')]
s = np.sum(s, axis = 1)
s = s.rename({'total_sales': 'year_to_date_sales'}, axis=1, level=1)
df = df.combine_first(s)
# I tried ['2021-06-28':'2021-10-11'] to select all columns from - to but it does not let me
而且我假设我会按行使用 np.sum(s, axis = 1)
到 sum
,所以我会得到每个 group
和 category
的结果。但是目前,我无法 select 整个兴趣范围,我的方法会为每周创建相同的值,而不是从 2021-06-28
开始的每个下周创建一个累积总和。
我怎样才能达到这样的结果?
更新
使用后
df1 = (df.loc[:, (['2021-06-28','2021-10-11'], 'total_sales')]
.rename({'total_sales': 'year_to_date_sales'}, axis=1, level=1))
df = df.join(df1.cumsum(axis=1)).sort_index(axis=1)
由于某种原因我没有添加新列:
df.loc[:,'2021-07-05'].columns
MultiIndex([('2021-07-05', 'total_orders'),
('2021-07-05', 'total_sales'),
('2021-07-05', 'last_year_sales'),
('2021-07-05', 'last_year_total_orders')]
names=['created_at', None])
当我检查什么 df1
只保留了 2021-06-28
的一列时,我预计每周会有多列。最终 df
中的新列仅存在于 2021-06-28
,但不存在于其他未来日期。
df data
df = pd.DataFrame.from_dict({('group', ''): {0: 'A',
1: 'A',
2: 'A',
3: 'A',
4: 'A',
5: 'A',
6: 'A',
7: 'A',
8: 'A',
9: 'B',
10: 'B',
11: 'B',
12: 'B',
13: 'B',
14: 'B',
15: 'B',
16: 'B',
17: 'B',
18: 'all',
19: 'all'},
('category', ''): {0: 'Amazon',
1: 'Apple',
2: 'Facebook',
3: 'Google',
4: 'Netflix',
5: 'Tesla',
6: 'Total',
7: 'Uber',
8: 'total',
9: 'Amazon',
10: 'Apple',
11: 'Facebook',
12: 'Google',
13: 'Netflix',
14: 'Tesla',
15: 'Total',
16: 'Uber',
17: 'total',
18: 'Total',
19: 'total'},
(pd.Timestamp('2020-06-29 00:00:00'), 'last_sales'): {0: 195.0,
1: 61.0,
2: 106.0,
3: 61.0,
4: 37.0,
5: 13.0,
6: 954.0,
7: 4.0,
8: 477.0,
9: 50.0,
10: 50.0,
11: 75.0,
12: 43.0,
13: 17.0,
14: 14.0,
15: 504.0,
16: 3.0,
17: 252.0,
18: 2916.0,
19: 2916.0},
(pd.Timestamp('2020-06-29 00:00:00'), 'total_sales'): {0: 1268.85,
1: 18274.385000000002,
2: 19722.65,
3: 55547.255,
4: 15323.800000000001,
5: 1688.6749999999997,
6: 227463.23,
7: 1906.0,
8: 113731.615,
9: 3219.6499999999996,
10: 15852.060000000001,
11: 17743.7,
12: 37795.15,
13: 5918.5,
14: 1708.75,
15: 166349.64,
16: 937.01,
17: 83174.82,
18: 787625.7400000001,
19: 787625.7400000001},
(pd.Timestamp('2020-06-29 00:00:00'), 'difference'): {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0},
(pd.Timestamp('2020-07-06 00:00:00'), 'last_sales'): {0: 26.0,
1: 39.0,
2: 79.0,
3: 49.0,
4: 10.0,
5: 10.0,
6: 436.0,
7: 5.0,
8: 218.0,
9: 89.0,
10: 34.0,
11: 133.0,
12: 66.0,
13: 21.0,
14: 20.0,
15: 732.0,
16: 3.0,
17: 366.0,
18: 2336.0,
19: 2336.0},
(pd.Timestamp('2020-07-06 00:00:00'), 'total_sales'): {0: 3978.15,
1: 12138.96,
2: 19084.175,
3: 40033.46000000001,
4: 4280.15,
5: 1495.1,
6: 165548.29,
7: 1764.15,
8: 82774.145,
9: 8314.92,
10: 12776.649999999996,
11: 28048.075,
12: 55104.21000000002,
13: 6962.844999999999,
14: 3053.2000000000003,
15: 231049.11000000002,
16: 1264.655,
17: 115524.55500000001,
18: 793194.8000000002,
19: 793194.8000000002},
(pd.Timestamp('2020-07-06 00:00:00'), 'difference'): {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0},
(pd.Timestamp('2021-06-28 00:00:00'), 'last_sales'): {0: 96.0,
1: 56.0,
2: 106.0,
3: 44.0,
4: 34.0,
5: 13.0,
6: 716.0,
7: 9.0,
8: 358.0,
9: 101.0,
10: 22.0,
11: 120.0,
12: 40.0,
13: 13.0,
14: 8.0,
15: 610.0,
16: 1.0,
17: 305.0,
18: 2652.0,
19: 2652.0},
(pd.Timestamp('2021-06-28 00:00:00'), 'total_sales'): {0: 5194.95,
1: 19102.219999999994,
2: 22796.420000000002,
3: 30853.115,
4: 11461.25,
5: 992.6,
6: 188143.41,
7: 3671.15,
8: 94071.705,
9: 6022.299999999998,
10: 7373.6,
11: 33514.0,
12: 35943.45,
13: 4749.000000000001,
14: 902.01,
15: 177707.32,
16: 349.3,
17: 88853.66,
18: 731701.46,
19: 731701.46},
(pd.Timestamp('2021-06-28 00:00:00'), 'difference'): {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0},
(pd.Timestamp('2021-07-07 00:00:00'), 'last_sales'): {0: 45.0,
1: 47.0,
2: 87.0,
3: 45.0,
4: 13.0,
5: 8.0,
6: 494.0,
7: 2.0,
8: 247.0,
9: 81.0,
10: 36.0,
11: 143.0,
12: 56.0,
13: 9.0,
14: 9.0,
15: 670.0,
16: 1.0,
17: 335.0,
18: 2328.0,
19: 2328.0},
(pd.Timestamp('2021-07-07 00:00:00'), 'total_sales'): {0: 7556.414999999998,
1: 14985.05,
2: 16790.899999999998,
3: 36202.729999999996,
4: 4024.97,
5: 1034.45,
6: 163960.32999999996,
7: 1385.65,
8: 81980.16499999998,
9: 5600.544999999999,
10: 11209.92,
11: 32832.61,
12: 42137.44500000001,
13: 3885.1499999999996,
14: 1191.5,
15: 194912.34000000003,
16: 599.0,
17: 97456.17000000001,
18: 717745.3400000001,
19: 717745.3400000001},
(pd.Timestamp('2021-07-07 00:00:00'), 'difference'): {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0}}).set_index(['group','category'])
将 DataFrame.cumsum
与 axis=1
一起使用,添加到 MultiIndex
中的原始列和排序列:
idx = pd.IndexSlice
df1 = (df.loc[:, idx['2021-06-28':'2021-10-11', 'total_sales']]
.rename({'total_sales': 'year_to_date_sales'}, axis=1, level=1))
df = df.join(df1.cumsum(axis=1)).sort_index(axis=1)
我有一个多索引数据框
df
2021-06-28 2021-07-05
total_orders total_sales last_year_sales last_year_total_orders total_orders total_sales last_year_sales last_year_total_orders
group category
A Amazon 195.000 1,268.850 5,194.950 195.000 ... ... ... ...
Netflix 37.000 15,323.800 11,461.250 20.00
Apple 61.000 18,274.385 19,102.220 30.00
Facebook 106.000 19,722.650 22,796.420 50.00
Tesla 13.000 1,688.675 992.600 25.00
Uber 4.000 1,906.000 3,671.150 15.00
Google 61.000 55,547.255 30,853.115 10.00
total 477.000 113,731.615 94,071.705 56.00
B Amazon 50.000 3,219.650 6,022.300 400.00
Netflix 17.000 5,918.500 4,749.000 30.00
Apple 50.000 15,852.060 7,373.600 27.00
Facebook 75.000 17,743.700 33,514.000 15.00
Tesla 14.000 1,708.750 902.010 66.00
Uber 3.000 937.010 349.300 94.00
Google 43.000 37,795.150 35,943.450 65.00
total 252.000 83,174.820 88,853.660 61.00
我正在尝试创建一个 year_to_date
列,其中 sum
将是从 2021-06-28
到今天的 total_sales
的 sum
。正如您在上面看到的那样,我每周有 4 个专栏,从 2020-06-29
到本周 2021-10-11
。因此,从 2021-06-28
开始的每个星期,我都会有一个列,它只包含从 2021-06-28
.
sales
的累计总和
我希望结果看起来像这样(我删除了一些列以获得更好的视觉效果):
2021-06-28 2021-07-05 2021-07-12
total_orders total_sales total_orders total_sales year_to_date_sales total_orders total_sales year_to_date_sales
group category
A Amazon 195.000 1,268.850 ... 1000 1,268.850 + 1000 ... 5000 1,268.850 + 1000 + 5000
Netflix 37.000 15,323.800
Apple 61.000 18,274.385 ... 2000 15,323.800 + 2000 ... 6000 1,268.850 + 1000 + 6000
Facebook 106.000 19,722.650
Tesla 13.000 1,688.675
Uber 4.000 1,906.000
Google 61.000 55,547.255
total 477.000 113,731.615
B Amazon 50.000 3,219.650
Netflix 17.000 5,918.500
Apple 50.000 15,852.060
Facebook 75.000 17,743.700
Tesla 14.000 1,708.750
Uber 3.000 937.010
Google 43.000 37,795.150
total 252.000 83,174.820
我试过了:
# Adding year to date sales
s = df.loc[:, (['2021-06-28','2021-10-11'], 'total_sales')]
s = np.sum(s, axis = 1)
s = s.rename({'total_sales': 'year_to_date_sales'}, axis=1, level=1)
df = df.combine_first(s)
# I tried ['2021-06-28':'2021-10-11'] to select all columns from - to but it does not let me
而且我假设我会按行使用 np.sum(s, axis = 1)
到 sum
,所以我会得到每个 group
和 category
的结果。但是目前,我无法 select 整个兴趣范围,我的方法会为每周创建相同的值,而不是从 2021-06-28
开始的每个下周创建一个累积总和。
我怎样才能达到这样的结果?
更新
使用后
df1 = (df.loc[:, (['2021-06-28','2021-10-11'], 'total_sales')]
.rename({'total_sales': 'year_to_date_sales'}, axis=1, level=1))
df = df.join(df1.cumsum(axis=1)).sort_index(axis=1)
由于某种原因我没有添加新列:
df.loc[:,'2021-07-05'].columns
MultiIndex([('2021-07-05', 'total_orders'),
('2021-07-05', 'total_sales'),
('2021-07-05', 'last_year_sales'),
('2021-07-05', 'last_year_total_orders')]
names=['created_at', None])
当我检查什么 df1
只保留了 2021-06-28
的一列时,我预计每周会有多列。最终 df
中的新列仅存在于 2021-06-28
,但不存在于其他未来日期。
df data
df = pd.DataFrame.from_dict({('group', ''): {0: 'A',
1: 'A',
2: 'A',
3: 'A',
4: 'A',
5: 'A',
6: 'A',
7: 'A',
8: 'A',
9: 'B',
10: 'B',
11: 'B',
12: 'B',
13: 'B',
14: 'B',
15: 'B',
16: 'B',
17: 'B',
18: 'all',
19: 'all'},
('category', ''): {0: 'Amazon',
1: 'Apple',
2: 'Facebook',
3: 'Google',
4: 'Netflix',
5: 'Tesla',
6: 'Total',
7: 'Uber',
8: 'total',
9: 'Amazon',
10: 'Apple',
11: 'Facebook',
12: 'Google',
13: 'Netflix',
14: 'Tesla',
15: 'Total',
16: 'Uber',
17: 'total',
18: 'Total',
19: 'total'},
(pd.Timestamp('2020-06-29 00:00:00'), 'last_sales'): {0: 195.0,
1: 61.0,
2: 106.0,
3: 61.0,
4: 37.0,
5: 13.0,
6: 954.0,
7: 4.0,
8: 477.0,
9: 50.0,
10: 50.0,
11: 75.0,
12: 43.0,
13: 17.0,
14: 14.0,
15: 504.0,
16: 3.0,
17: 252.0,
18: 2916.0,
19: 2916.0},
(pd.Timestamp('2020-06-29 00:00:00'), 'total_sales'): {0: 1268.85,
1: 18274.385000000002,
2: 19722.65,
3: 55547.255,
4: 15323.800000000001,
5: 1688.6749999999997,
6: 227463.23,
7: 1906.0,
8: 113731.615,
9: 3219.6499999999996,
10: 15852.060000000001,
11: 17743.7,
12: 37795.15,
13: 5918.5,
14: 1708.75,
15: 166349.64,
16: 937.01,
17: 83174.82,
18: 787625.7400000001,
19: 787625.7400000001},
(pd.Timestamp('2020-06-29 00:00:00'), 'difference'): {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0},
(pd.Timestamp('2020-07-06 00:00:00'), 'last_sales'): {0: 26.0,
1: 39.0,
2: 79.0,
3: 49.0,
4: 10.0,
5: 10.0,
6: 436.0,
7: 5.0,
8: 218.0,
9: 89.0,
10: 34.0,
11: 133.0,
12: 66.0,
13: 21.0,
14: 20.0,
15: 732.0,
16: 3.0,
17: 366.0,
18: 2336.0,
19: 2336.0},
(pd.Timestamp('2020-07-06 00:00:00'), 'total_sales'): {0: 3978.15,
1: 12138.96,
2: 19084.175,
3: 40033.46000000001,
4: 4280.15,
5: 1495.1,
6: 165548.29,
7: 1764.15,
8: 82774.145,
9: 8314.92,
10: 12776.649999999996,
11: 28048.075,
12: 55104.21000000002,
13: 6962.844999999999,
14: 3053.2000000000003,
15: 231049.11000000002,
16: 1264.655,
17: 115524.55500000001,
18: 793194.8000000002,
19: 793194.8000000002},
(pd.Timestamp('2020-07-06 00:00:00'), 'difference'): {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0},
(pd.Timestamp('2021-06-28 00:00:00'), 'last_sales'): {0: 96.0,
1: 56.0,
2: 106.0,
3: 44.0,
4: 34.0,
5: 13.0,
6: 716.0,
7: 9.0,
8: 358.0,
9: 101.0,
10: 22.0,
11: 120.0,
12: 40.0,
13: 13.0,
14: 8.0,
15: 610.0,
16: 1.0,
17: 305.0,
18: 2652.0,
19: 2652.0},
(pd.Timestamp('2021-06-28 00:00:00'), 'total_sales'): {0: 5194.95,
1: 19102.219999999994,
2: 22796.420000000002,
3: 30853.115,
4: 11461.25,
5: 992.6,
6: 188143.41,
7: 3671.15,
8: 94071.705,
9: 6022.299999999998,
10: 7373.6,
11: 33514.0,
12: 35943.45,
13: 4749.000000000001,
14: 902.01,
15: 177707.32,
16: 349.3,
17: 88853.66,
18: 731701.46,
19: 731701.46},
(pd.Timestamp('2021-06-28 00:00:00'), 'difference'): {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0},
(pd.Timestamp('2021-07-07 00:00:00'), 'last_sales'): {0: 45.0,
1: 47.0,
2: 87.0,
3: 45.0,
4: 13.0,
5: 8.0,
6: 494.0,
7: 2.0,
8: 247.0,
9: 81.0,
10: 36.0,
11: 143.0,
12: 56.0,
13: 9.0,
14: 9.0,
15: 670.0,
16: 1.0,
17: 335.0,
18: 2328.0,
19: 2328.0},
(pd.Timestamp('2021-07-07 00:00:00'), 'total_sales'): {0: 7556.414999999998,
1: 14985.05,
2: 16790.899999999998,
3: 36202.729999999996,
4: 4024.97,
5: 1034.45,
6: 163960.32999999996,
7: 1385.65,
8: 81980.16499999998,
9: 5600.544999999999,
10: 11209.92,
11: 32832.61,
12: 42137.44500000001,
13: 3885.1499999999996,
14: 1191.5,
15: 194912.34000000003,
16: 599.0,
17: 97456.17000000001,
18: 717745.3400000001,
19: 717745.3400000001},
(pd.Timestamp('2021-07-07 00:00:00'), 'difference'): {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0}}).set_index(['group','category'])
将 DataFrame.cumsum
与 axis=1
一起使用,添加到 MultiIndex
中的原始列和排序列:
idx = pd.IndexSlice
df1 = (df.loc[:, idx['2021-06-28':'2021-10-11', 'total_sales']]
.rename({'total_sales': 'year_to_date_sales'}, axis=1, level=1))
df = df.join(df1.cumsum(axis=1)).sort_index(axis=1)