如何从数据框中将多个级别的聚合总和放入时间序列列
How to get multiple levels of aggregated sums into time series columns from a dataframe
我有一个 pandas 数据框,其中包含各个层级的每月计数。它是长格式,我想转换为宽格式,每个聚合级别都有列。
格式如下:
date | country | state | county | population
01-01| cc1 | s1 | c1 | 5
01-01| cc1 | s1 | c2 | 4
01-01| cc1 | s2 | c1 | 10
01-01| cc1 | s2 | c2 | 11
02-01| cc1 | s1 | c1 | 6
02-01| cc1 | s1 | c2 | 5
02-01| cc1 | s2 | c1 | 11
02-01| cc1 | s2 | c2 | 12
.
.
现在我想将其转换为以下格式:
date | country_pop| s1_pop | s2_pop| .. | s1_c1_pop | s1_c2_pop| s2_c1_pop | s2_c2_pop|..
01-01| 30 | 9 | 21 | ...| 5 | 4 | 10 | 11 |..
02-01| 34 | 11 | 23 | ...| 6 | 5 | 11 | 12 |..
.
.
状态总数为,4,s1....s4.
每个州的县都可以标记为 c1....c10(有些州可能更少,我希望这些列为零。)
我想获得每个聚合级别的时间序列,按日期排序。我怎么得到这个?
让我们使用带有级别参数的总和和 pd.concat 所有数据帧一起这样做。
#Aggregate to lowest level of detail
df_agg = df.groupby(['country', 'date', 'state', 'county'])[['population']].sum()
#Reshape dataframe and flatten multiindex column header
df_county = df_agg.unstack([-1, -2])
df_county.columns = [f'{s}_{c}_{p}' for p, c, s in df_county.columns]
#Sum to next level of detail and reshape
df_state = df_agg.sum(level=[0, 1, 2]).unstack()
df_state.columns = [f'{s}_{p}' for p, s in df_state.columns]
#Sum to country level
df_country = df_agg.sum(level=[0, 1])
#pd.concat horizontally with axis=1
df_out = pd.concat([df_country, df_state, df_county], axis=1).reset_index()
输出:
country date population s1_population s2_population s1_c1_population \
0 cc1 01-01 30 9 21 5
1 cc1 02-01 34 11 23 6
s1_c2_population s2_c1_population s2_c2_population
0 4 10 11
1 5 11 12
我有一个 pandas 数据框,其中包含各个层级的每月计数。它是长格式,我想转换为宽格式,每个聚合级别都有列。
格式如下:
date | country | state | county | population
01-01| cc1 | s1 | c1 | 5
01-01| cc1 | s1 | c2 | 4
01-01| cc1 | s2 | c1 | 10
01-01| cc1 | s2 | c2 | 11
02-01| cc1 | s1 | c1 | 6
02-01| cc1 | s1 | c2 | 5
02-01| cc1 | s2 | c1 | 11
02-01| cc1 | s2 | c2 | 12
.
.
现在我想将其转换为以下格式:
date | country_pop| s1_pop | s2_pop| .. | s1_c1_pop | s1_c2_pop| s2_c1_pop | s2_c2_pop|..
01-01| 30 | 9 | 21 | ...| 5 | 4 | 10 | 11 |..
02-01| 34 | 11 | 23 | ...| 6 | 5 | 11 | 12 |..
.
.
状态总数为,4,s1....s4.
每个州的县都可以标记为 c1....c10(有些州可能更少,我希望这些列为零。)
我想获得每个聚合级别的时间序列,按日期排序。我怎么得到这个?
让我们使用带有级别参数的总和和 pd.concat 所有数据帧一起这样做。
#Aggregate to lowest level of detail
df_agg = df.groupby(['country', 'date', 'state', 'county'])[['population']].sum()
#Reshape dataframe and flatten multiindex column header
df_county = df_agg.unstack([-1, -2])
df_county.columns = [f'{s}_{c}_{p}' for p, c, s in df_county.columns]
#Sum to next level of detail and reshape
df_state = df_agg.sum(level=[0, 1, 2]).unstack()
df_state.columns = [f'{s}_{p}' for p, s in df_state.columns]
#Sum to country level
df_country = df_agg.sum(level=[0, 1])
#pd.concat horizontally with axis=1
df_out = pd.concat([df_country, df_state, df_county], axis=1).reset_index()
输出:
country date population s1_population s2_population s1_c1_population \
0 cc1 01-01 30 9 21 5
1 cc1 02-01 34 11 23 6
s1_c2_population s2_c1_population s2_c2_population
0 4 10 11
1 5 11 12