Pandas 数据透视表和小计
Pandas pivot and subtotals
使用此数据 -
d2 = {'Division': ['DIV1', 'DIV2', 'DIV1', 'DIV3', 'DIV2'],'Region': ['DIV1-South', 'DIV2-North', 'DIV1-North', "DIV3-East", "DIV2-South"]
,'MD': ["Susie", 'Martha', "Jane", "Nichole", "Randall"], 'Month': ['JAN', 'JAN', 'FEB', 'MAR', "APR"]}
df2 = pd.DataFrame(d2)
看起来像这样:
Division Region MD Month
0 DIV1 DIV1-South Susie JAN
1 DIV2 DIV2-North Martha JAN
2 DIV1 DIV1-North Jane FEB
3 DIV3 DIV3-East Nichole MAR
4 DIV2 DIV2-South Randall APR
感谢这里的社区,我能够对这些数据进行透视以获得不同月份的总数:使用这行代码
pivoted = df.pivot_table(index=['Division', 'Region', 'NP'], columns='Month', aggfunc=len, fill_value=0)
Month APR FEB JAN MAR
Division Region MD
DIV1 DIV1-North Jane 0 1 0 0
DIV1-South Susie 0 0 1 0
DIV2 DIV2-North Martha 0 0 1 0
DIV2-South Randall 1 0 0 0
DIV3 DIV3-East Nichole 0 0 0 1
所以,这可能是不可能的,但我只在网上找到一个参考资料来生成一个数据透视结果,其中包括各个部分的小计。不幸的是,那个例子没有用。
理想的结果是:
Month APR FEB JAN MAR
Division Region MD
DIV1 DIV1-North Jane 0 1 0 0
DIV1-North SubTotal 0 1 0 0
DIV1-South Susie 0 0 1 0
DIV1-South SubTotal 0 0 1 0
DIV1 TOTAL 0 1 1 0
DIV2 DIV2-North Martha 0 0 1 0
DIV2-North SubTotal 0 0 1 0
DIV2-South Randall 1 0 0 0
DIV2-South SubTotal 1 0 0 0
DIV2 TOTAL 1 0 1 0
DIV3 DIV3-East Nichole 0 0 0 1
DIV3-East SubTotal 0 0 0 1
DIV3 TOTAL 0 0 0 1
这有点费脑筋,甚至可能是不可能的,但由于这在 Excel 数据透视表中相当容易,我希望 pandas 某个地方启用了此功能,我只是找不到它。 (尽管经过几天的搜索和测试,这一点仍然是正确的。)
df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
"bar", "bar", "bar", "bar"],
"B": ["one", "one", "one", "two", "two",
"one", "one", "two", "two"],
"C": ["small", "large", "large", "small",
"small", "large", "small", "small",
"large"],
"D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
"E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
输出
A B C D E
0 foo one small 1 2
1 foo one large 2 4
2 foo one large 2 5
3 foo two small 3 5
4 foo two small 3 6
5 bar one large 4 6
6 bar one small 5 8
7 bar two small 6 9
table = pd.pivot_table(df, values='D', index=['A', 'B'],
columns=['C'], aggfunc=np.sum)
输出枢轴table
table
C large small
A B
bar one 4.0 5.0
two 7.0 6.0
foo one 4.0 1.0
two NaN 6.0
您可以通过按 .groupby()
and GroupBy.sum()
,如下:
pivoted2 = pivoted.reset_index()
# Create `Division` Total
df_Div_sum = pivoted2.groupby('Division', as_index=False).sum()
df_Div_sum['Region'] = '_' + df_Div_sum['Division'] + ' Total'
df_Div_sum['MD'] = ''
# Create `Region` SubTotal
df_Reg_sum = pivoted2.groupby(['Division', 'Region'], as_index=False).sum()
df_Reg_sum['MD'] = '_' + df_Reg_sum['Region'] + ' SubTotal'
# Concat results and set index + sort index
df_out = (pd.concat([pivoted2,
df_Reg_sum,
df_Div_sum
])
.set_index(['Division', 'Region', 'MD'])
.sort_index()
)
输入设置
d2 = {'Division': ['DIV1', 'DIV2', 'DIV1', 'DIV3', 'DIV2'],'Region': ['DIV1-South', 'DIV2-North', 'DIV1-North', "DIV3-East", "DIV2-South"]
,'MD': ["Susie", 'Martha', "Jane", "Nichole", "Randall"], 'Month': ['JAN', 'JAN', 'FEB', 'MAR', "APR"]}
df = pd.DataFrame(d2)
pivoted = df.pivot_table(index=['Division', 'Region', 'MD'], columns='Month', aggfunc=len, fill_value=0)
输出
print(df_out)
Month APR FEB JAN MAR
Division Region MD
DIV1 DIV1-North Jane 0 1 0 0
_DIV1-North SubTotal 0 1 0 0
DIV1-South Susie 0 0 1 0
_DIV1-South SubTotal 0 0 1 0
_DIV1 Total 0 1 1 0
DIV2 DIV2-North Martha 0 0 1 0
_DIV2-North SubTotal 0 0 1 0
DIV2-South Randall 1 0 0 0
_DIV2-South SubTotal 1 0 0 0
_DIV2 Total 1 0 1 0
DIV3 DIV3-East Nichole 0 0 0 1
_DIV3-East SubTotal 0 0 0 1
_DIV3 Total 0 0 0 1
扩展测试数据
由于您的示例数据每个Region
只有一个数据,我添加了更多测试数据以进行更完整的测试:
输入设置
data = {'Division': ['DIV1', 'DIV1', 'DIV2', 'DIV2', 'DIV1', 'DIV1', 'DIV3', 'DIV3', 'DIV2', 'DIV2', 'DIV2'],
'Region': ['DIV1-South', 'DIV1-South', 'DIV2-North', 'DIV2-North', 'DIV1-North', 'DIV1-North', 'DIV3-East', 'DIV3-East', 'DIV2-South', 'DIV2-South', 'DIV2-South'],
'MD': ['Susie', 'Susie2', 'Martha', 'Martha2', 'Jane', 'Jane2', 'Nichole', 'Nichole2', 'Randall2', 'Randall3', 'Randall'],
'Month': ['JAN', 'FEB', 'JAN', 'MAR', 'FEB', 'APR', 'MAR', 'APR', 'FEB', 'MAR', 'APR']}
df = pd.DataFrame(data)
pivoted = df.pivot_table(index=['Division', 'Region', 'MD'], columns='Month', aggfunc=len, fill_value=0)
print(pivoted)
Month APR FEB JAN MAR
Division Region MD
DIV1 DIV1-North Jane 0 1 0 0
Jane2 1 0 0 0
DIV1-South Susie 0 0 1 0
Susie2 0 1 0 0
DIV2 DIV2-North Martha 0 0 1 0
Martha2 0 0 0 1
DIV2-South Randall 1 0 0 0
Randall2 0 1 0 0
Randall3 0 0 0 1
DIV3 DIV3-East Nichole 0 0 0 1
Nichole2 1 0 0 0
输出
print(df_out)
Month APR FEB JAN MAR
Division Region MD
DIV1 DIV1-North Jane 0 1 0 0
Jane2 1 0 0 0
_DIV1-North SubTotal 1 1 0 0
DIV1-South Susie 0 0 1 0
Susie2 0 1 0 0
_DIV1-South SubTotal 0 1 1 0
_DIV1 Total 1 2 1 0
DIV2 DIV2-North Martha 0 0 1 0
Martha2 0 0 0 1
_DIV2-North SubTotal 0 0 1 1
DIV2-South Randall 1 0 0 0
Randall2 0 1 0 0
Randall3 0 0 0 1
_DIV2-South SubTotal 1 1 0 1
_DIV2 Total 1 1 1 2
DIV3 DIV3-East Nichole 0 0 0 1
Nichole2 1 0 0 0
_DIV3-East SubTotal 1 0 0 1
_DIV3 Total 1 0 0 1
使用此数据 -
d2 = {'Division': ['DIV1', 'DIV2', 'DIV1', 'DIV3', 'DIV2'],'Region': ['DIV1-South', 'DIV2-North', 'DIV1-North', "DIV3-East", "DIV2-South"]
,'MD': ["Susie", 'Martha', "Jane", "Nichole", "Randall"], 'Month': ['JAN', 'JAN', 'FEB', 'MAR', "APR"]}
df2 = pd.DataFrame(d2)
看起来像这样:
Division Region MD Month
0 DIV1 DIV1-South Susie JAN
1 DIV2 DIV2-North Martha JAN
2 DIV1 DIV1-North Jane FEB
3 DIV3 DIV3-East Nichole MAR
4 DIV2 DIV2-South Randall APR
感谢这里的社区,我能够对这些数据进行透视以获得不同月份的总数:使用这行代码
pivoted = df.pivot_table(index=['Division', 'Region', 'NP'], columns='Month', aggfunc=len, fill_value=0)
Month APR FEB JAN MAR
Division Region MD
DIV1 DIV1-North Jane 0 1 0 0
DIV1-South Susie 0 0 1 0
DIV2 DIV2-North Martha 0 0 1 0
DIV2-South Randall 1 0 0 0
DIV3 DIV3-East Nichole 0 0 0 1
所以,这可能是不可能的,但我只在网上找到一个参考资料来生成一个数据透视结果,其中包括各个部分的小计。不幸的是,那个例子没有用。
理想的结果是:
Month APR FEB JAN MAR
Division Region MD
DIV1 DIV1-North Jane 0 1 0 0
DIV1-North SubTotal 0 1 0 0
DIV1-South Susie 0 0 1 0
DIV1-South SubTotal 0 0 1 0
DIV1 TOTAL 0 1 1 0
DIV2 DIV2-North Martha 0 0 1 0
DIV2-North SubTotal 0 0 1 0
DIV2-South Randall 1 0 0 0
DIV2-South SubTotal 1 0 0 0
DIV2 TOTAL 1 0 1 0
DIV3 DIV3-East Nichole 0 0 0 1
DIV3-East SubTotal 0 0 0 1
DIV3 TOTAL 0 0 0 1
这有点费脑筋,甚至可能是不可能的,但由于这在 Excel 数据透视表中相当容易,我希望 pandas 某个地方启用了此功能,我只是找不到它。 (尽管经过几天的搜索和测试,这一点仍然是正确的。)
df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
"bar", "bar", "bar", "bar"],
"B": ["one", "one", "one", "two", "two",
"one", "one", "two", "two"],
"C": ["small", "large", "large", "small",
"small", "large", "small", "small",
"large"],
"D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
"E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
输出
A B C D E
0 foo one small 1 2
1 foo one large 2 4
2 foo one large 2 5
3 foo two small 3 5
4 foo two small 3 6
5 bar one large 4 6
6 bar one small 5 8
7 bar two small 6 9
table = pd.pivot_table(df, values='D', index=['A', 'B'],
columns=['C'], aggfunc=np.sum)
输出枢轴table
table
C large small
A B
bar one 4.0 5.0
two 7.0 6.0
foo one 4.0 1.0
two NaN 6.0
您可以通过按 .groupby()
and GroupBy.sum()
,如下:
pivoted2 = pivoted.reset_index()
# Create `Division` Total
df_Div_sum = pivoted2.groupby('Division', as_index=False).sum()
df_Div_sum['Region'] = '_' + df_Div_sum['Division'] + ' Total'
df_Div_sum['MD'] = ''
# Create `Region` SubTotal
df_Reg_sum = pivoted2.groupby(['Division', 'Region'], as_index=False).sum()
df_Reg_sum['MD'] = '_' + df_Reg_sum['Region'] + ' SubTotal'
# Concat results and set index + sort index
df_out = (pd.concat([pivoted2,
df_Reg_sum,
df_Div_sum
])
.set_index(['Division', 'Region', 'MD'])
.sort_index()
)
输入设置
d2 = {'Division': ['DIV1', 'DIV2', 'DIV1', 'DIV3', 'DIV2'],'Region': ['DIV1-South', 'DIV2-North', 'DIV1-North', "DIV3-East", "DIV2-South"]
,'MD': ["Susie", 'Martha', "Jane", "Nichole", "Randall"], 'Month': ['JAN', 'JAN', 'FEB', 'MAR', "APR"]}
df = pd.DataFrame(d2)
pivoted = df.pivot_table(index=['Division', 'Region', 'MD'], columns='Month', aggfunc=len, fill_value=0)
输出
print(df_out)
Month APR FEB JAN MAR
Division Region MD
DIV1 DIV1-North Jane 0 1 0 0
_DIV1-North SubTotal 0 1 0 0
DIV1-South Susie 0 0 1 0
_DIV1-South SubTotal 0 0 1 0
_DIV1 Total 0 1 1 0
DIV2 DIV2-North Martha 0 0 1 0
_DIV2-North SubTotal 0 0 1 0
DIV2-South Randall 1 0 0 0
_DIV2-South SubTotal 1 0 0 0
_DIV2 Total 1 0 1 0
DIV3 DIV3-East Nichole 0 0 0 1
_DIV3-East SubTotal 0 0 0 1
_DIV3 Total 0 0 0 1
扩展测试数据
由于您的示例数据每个Region
只有一个数据,我添加了更多测试数据以进行更完整的测试:
输入设置
data = {'Division': ['DIV1', 'DIV1', 'DIV2', 'DIV2', 'DIV1', 'DIV1', 'DIV3', 'DIV3', 'DIV2', 'DIV2', 'DIV2'],
'Region': ['DIV1-South', 'DIV1-South', 'DIV2-North', 'DIV2-North', 'DIV1-North', 'DIV1-North', 'DIV3-East', 'DIV3-East', 'DIV2-South', 'DIV2-South', 'DIV2-South'],
'MD': ['Susie', 'Susie2', 'Martha', 'Martha2', 'Jane', 'Jane2', 'Nichole', 'Nichole2', 'Randall2', 'Randall3', 'Randall'],
'Month': ['JAN', 'FEB', 'JAN', 'MAR', 'FEB', 'APR', 'MAR', 'APR', 'FEB', 'MAR', 'APR']}
df = pd.DataFrame(data)
pivoted = df.pivot_table(index=['Division', 'Region', 'MD'], columns='Month', aggfunc=len, fill_value=0)
print(pivoted)
Month APR FEB JAN MAR
Division Region MD
DIV1 DIV1-North Jane 0 1 0 0
Jane2 1 0 0 0
DIV1-South Susie 0 0 1 0
Susie2 0 1 0 0
DIV2 DIV2-North Martha 0 0 1 0
Martha2 0 0 0 1
DIV2-South Randall 1 0 0 0
Randall2 0 1 0 0
Randall3 0 0 0 1
DIV3 DIV3-East Nichole 0 0 0 1
Nichole2 1 0 0 0
输出
print(df_out)
Month APR FEB JAN MAR
Division Region MD
DIV1 DIV1-North Jane 0 1 0 0
Jane2 1 0 0 0
_DIV1-North SubTotal 1 1 0 0
DIV1-South Susie 0 0 1 0
Susie2 0 1 0 0
_DIV1-South SubTotal 0 1 1 0
_DIV1 Total 1 2 1 0
DIV2 DIV2-North Martha 0 0 1 0
Martha2 0 0 0 1
_DIV2-North SubTotal 0 0 1 1
DIV2-South Randall 1 0 0 0
Randall2 0 1 0 0
Randall3 0 0 0 1
_DIV2-South SubTotal 1 1 0 1
_DIV2 Total 1 1 1 2
DIV3 DIV3-East Nichole 0 0 0 1
Nichole2 1 0 0 0
_DIV3-East SubTotal 1 0 0 1
_DIV3 Total 1 0 0 1