通过将行交换为列并取 Pandas 中每列的总和,将单个 df 转换为多个 df
Convert single df to multiple dfs by interchanging rows to columns and taking sum of each column in Pandas
我有以下 pandas 数据框:
Depts
Category
Monthly Booked
Monthly Delivered
Monthly Target
Yearly Booked
Yearly Delivered
Yearly Target
HR
Human
2345
2000
3000
1234556
234543
6432212
Software
Engg
654345
343213
765432
98765123
2345654
9999999
Security
Human
1234
1234
2000
23456
34568
234567
Software
Engg
12345
54334
324546
345645345
65345654
643563452
Software
Human
12345
54334
324546
345645345
65345654
643563452
Security
Engg
12345
54334
324546
34564534
65345654
643563452
现在我想将 Depts
的值转换为 headers 列并按 Category
分组,然后将每年和每月的总和与每个指标的总和一起放入两个数据表每列。
如下所示:
每月数据
Category
Metric
Software
Security
HR
Engg
Target
1089978
324546
Delivered
397547
12345
Booked
666690
54334
Human
Target
324546
2000
3000
Delivered
54334
1234
2000
Booked
12345
1234
2345
Total
Target
1414524
326546
3000
Delivered
451881
1234
2000
Booked
679035
55568
2345
年度数据
Category
Metric
Software
Security
HR
Engg
Target
653563451
643563452
Delivered
67691308
65345654
Booked
44410468
34564534
Human
Target
643563452
234567
6432212
Delivered
65345654
34568
234543
Booked
345645345
23456
1234556
Total
Target
1297126903
643798019
6432212
Delivered
133036962
65380222
234543
Booked
390055813
34587990
1234556
我可以使用 pandas 函数来实现吗?如果是,那我该怎么做?
注意:我还想保留分组,但将索引更改为列。意思是我想将索引名称更改为列名称,但将分组保留在前两列中。
我现在的代码——基于@Code 给出的答案,下面不同:
tmp = df.set_index(["Category", "Depts"])
tmp.columns = pd.MultiIndex.from_tuples([tuple(col.split(" ")) for col in tmp.columns], name=[None, "Metric"])
tmp = tmp.stack(level=1)
monthly = tmp.pivot_table(index=["Category", "Metric"], columns="Depts", values="Monthly", aggfunc="sum")
monthly = pd.concat([d.append(d.sum().rename(('Total', k))) for k, d in monthly.groupby(level=1)])
monthly = monthly.groupby(level=[0, 1], as_index=True).sum()
monthly.loc[:,'Total'] = monthly.sum(axis=1)
这保留了多级索引,但如果我使用 reset_index
,那么如果我使用 to_html
或 to_excel
函数,分组将丢失。我想避免这种情况。
试试这个:
tmp = df.set_index(["Category", "Depts"])
tmp.columns = pd.MultiIndex.from_tuples([tuple(col.split(" ")) for col in tmp.columns], name=[None, "Metric"])
tmp = tmp.stack(level=1)
monthly = tmp.pivot_table(index=["Category", "Metric"], columns="Depts", values="Monthly", aggfunc="sum")
yearly = tmp.pivot_table(index=["Category", "Metric"], columns="Depts", values="Yearly", aggfunc="sum")
灵感来自于其他答案的作品:
def pivot_and_stuff(dataframe, values):
tmp = dataframe.set_index(["Category", "Depts"])
tmp.columns = pd.MultiIndex.from_tuples([tuple(col.split(" ")) for col in tmp.columns], name=[None, "Metric"])
tmp = tmp.stack(level=1)
tmp = tmp.pivot_table(index=["Category", "Metric"], columns="Depts", values=values, aggfunc="sum")
tmp2 = tmp.groupby(level=[1]).sum()
tmp2['Category'] = 'Total'
tmp2 = tmp2.set_index('Category', append=True).reorder_levels([1,0])
dataframe = pd.concat([tmp, tmp2]).rename_axis('', axis=1).rename_axis(['Category', 'Metric'])
dataframe = dataframe.reset_index().rename_axis('', axis=1)
dataframe.Category = [i if not j else '' for i, j in zip(dataframe.Category.values, dataframe.Category.duplicated())]
return dataframe
pd.set_option('display.float_format', '{:.0f}'.format)
df_m = pivot_and_stuff(df, 'Monthly')
df_y = pivot_and_stuff(df, 'Yearly')
print(df_m)
print()
print(df_y)
输出:
Category Metric HR Security Software
0 Engg Booked NaN 12345 666690
1 Delivered NaN 54334 397547
2 Target NaN 324546 1089978
3 Human Booked 2345 1234 12345
4 Delivered 2000 1234 54334
5 Target 3000 2000 324546
6 Total Booked 2345 13579 679035
7 Delivered 2000 55568 451881
8 Target 3000 326546 1414524
Category Metric HR Security Software
0 Engg Booked NaN 34564534 444410468
1 Delivered NaN 65345654 67691308
2 Target NaN 643563452 653563451
3 Human Booked 1234556 23456 345645345
4 Delivered 234543 34568 65345654
5 Target 6432212 234567 643563452
6 Total Booked 1234556 34587990 790055813
7 Delivered 234543 65380222 133036962
8 Target 6432212 643798019 1297126903
我有以下 pandas 数据框:
Depts | Category | Monthly Booked | Monthly Delivered | Monthly Target | Yearly Booked | Yearly Delivered | Yearly Target |
---|---|---|---|---|---|---|---|
HR | Human | 2345 | 2000 | 3000 | 1234556 | 234543 | 6432212 |
Software | Engg | 654345 | 343213 | 765432 | 98765123 | 2345654 | 9999999 |
Security | Human | 1234 | 1234 | 2000 | 23456 | 34568 | 234567 |
Software | Engg | 12345 | 54334 | 324546 | 345645345 | 65345654 | 643563452 |
Software | Human | 12345 | 54334 | 324546 | 345645345 | 65345654 | 643563452 |
Security | Engg | 12345 | 54334 | 324546 | 34564534 | 65345654 | 643563452 |
现在我想将 Depts
的值转换为 headers 列并按 Category
分组,然后将每年和每月的总和与每个指标的总和一起放入两个数据表每列。
如下所示:
每月数据
Category | Metric | Software | Security | HR |
---|---|---|---|---|
Engg | Target | 1089978 | 324546 | |
Delivered | 397547 | 12345 | ||
Booked | 666690 | 54334 | ||
Human | Target | 324546 | 2000 | 3000 |
Delivered | 54334 | 1234 | 2000 | |
Booked | 12345 | 1234 | 2345 | |
Total | Target | 1414524 | 326546 | 3000 |
Delivered | 451881 | 1234 | 2000 | |
Booked | 679035 | 55568 | 2345 |
年度数据
Category | Metric | Software | Security | HR |
---|---|---|---|---|
Engg | Target | 653563451 | 643563452 | |
Delivered | 67691308 | 65345654 | ||
Booked | 44410468 | 34564534 | ||
Human | Target | 643563452 | 234567 | 6432212 |
Delivered | 65345654 | 34568 | 234543 | |
Booked | 345645345 | 23456 | 1234556 | |
Total | Target | 1297126903 | 643798019 | 6432212 |
Delivered | 133036962 | 65380222 | 234543 | |
Booked | 390055813 | 34587990 | 1234556 |
我可以使用 pandas 函数来实现吗?如果是,那我该怎么做? 注意:我还想保留分组,但将索引更改为列。意思是我想将索引名称更改为列名称,但将分组保留在前两列中。
我现在的代码——基于@Code 给出的答案,下面不同:
tmp = df.set_index(["Category", "Depts"])
tmp.columns = pd.MultiIndex.from_tuples([tuple(col.split(" ")) for col in tmp.columns], name=[None, "Metric"])
tmp = tmp.stack(level=1)
monthly = tmp.pivot_table(index=["Category", "Metric"], columns="Depts", values="Monthly", aggfunc="sum")
monthly = pd.concat([d.append(d.sum().rename(('Total', k))) for k, d in monthly.groupby(level=1)])
monthly = monthly.groupby(level=[0, 1], as_index=True).sum()
monthly.loc[:,'Total'] = monthly.sum(axis=1)
这保留了多级索引,但如果我使用 reset_index
,那么如果我使用 to_html
或 to_excel
函数,分组将丢失。我想避免这种情况。
试试这个:
tmp = df.set_index(["Category", "Depts"])
tmp.columns = pd.MultiIndex.from_tuples([tuple(col.split(" ")) for col in tmp.columns], name=[None, "Metric"])
tmp = tmp.stack(level=1)
monthly = tmp.pivot_table(index=["Category", "Metric"], columns="Depts", values="Monthly", aggfunc="sum")
yearly = tmp.pivot_table(index=["Category", "Metric"], columns="Depts", values="Yearly", aggfunc="sum")
灵感来自于其他答案的作品:
def pivot_and_stuff(dataframe, values):
tmp = dataframe.set_index(["Category", "Depts"])
tmp.columns = pd.MultiIndex.from_tuples([tuple(col.split(" ")) for col in tmp.columns], name=[None, "Metric"])
tmp = tmp.stack(level=1)
tmp = tmp.pivot_table(index=["Category", "Metric"], columns="Depts", values=values, aggfunc="sum")
tmp2 = tmp.groupby(level=[1]).sum()
tmp2['Category'] = 'Total'
tmp2 = tmp2.set_index('Category', append=True).reorder_levels([1,0])
dataframe = pd.concat([tmp, tmp2]).rename_axis('', axis=1).rename_axis(['Category', 'Metric'])
dataframe = dataframe.reset_index().rename_axis('', axis=1)
dataframe.Category = [i if not j else '' for i, j in zip(dataframe.Category.values, dataframe.Category.duplicated())]
return dataframe
pd.set_option('display.float_format', '{:.0f}'.format)
df_m = pivot_and_stuff(df, 'Monthly')
df_y = pivot_and_stuff(df, 'Yearly')
print(df_m)
print()
print(df_y)
输出:
Category Metric HR Security Software
0 Engg Booked NaN 12345 666690
1 Delivered NaN 54334 397547
2 Target NaN 324546 1089978
3 Human Booked 2345 1234 12345
4 Delivered 2000 1234 54334
5 Target 3000 2000 324546
6 Total Booked 2345 13579 679035
7 Delivered 2000 55568 451881
8 Target 3000 326546 1414524
Category Metric HR Security Software
0 Engg Booked NaN 34564534 444410468
1 Delivered NaN 65345654 67691308
2 Target NaN 643563452 653563451
3 Human Booked 1234556 23456 345645345
4 Delivered 234543 34568 65345654
5 Target 6432212 234567 643563452
6 Total Booked 1234556 34587990 790055813
7 Delivered 234543 65380222 133036962
8 Target 6432212 643798019 1297126903