pandas groupby 计算 groupby 列的百分比
pandas groupby to calculate percentage of groupby columns
我想计算 rate_death 百分比如下 -
(new_deaths / population) * 100 按位置分组并求和 new_deaths.
示例:对于阿富汗,rate_death 必须计算为 ((1+4+10) / 38928341) * 100
对于阿尔巴尼亚,它必须计算为 ((0+0+1) / 2877800) * 100
以下是我试过但没有用的数据和方法-
df_data
location date new_cases new_deaths population
0 Afghanistan 4/25/2020 70 1 38928341
1 Afghanistan 4/26/2020 112 4 38928341
2 Afghanistan 4/27/2020 68 10 38928341
3 Albania 4/25/2020 15 0 2877800
4 Albania 4/26/2020 34 0 2877800
5 Albania 4/27/2020 14 1 2877800
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 location 6 non-null object
1 date 6 non-null object
2 new_cases 6 non-null int64
3 new_deaths 6 non-null int64
4 population 6 non-null int64
方法一:
df_res = df_data[['location','new_deaths','population']].groupby(['location']).sum()
location new_deaths population
Afghanistan 15 116785023
Albania 1 8633400
df_res['rate_death'] = (df_res['new_deaths'] / df_res['population'] * 100.0)
location new_deaths population rate_death
Afghanistan 15 116785023 0.000
Albania 1 8633400 0.000
我知道由于上面的 groupby 和 'sum' 操作,人口总计两次,但我仍然想知道为什么 rate_death 没有按预期计算百分比,而是显示为 0.000
方法 2:(已按此 post - Pandas percentage of total with groupby 中所述进行尝试)
location_population = df_data.groupby(['location', 'population']).agg({'new_deaths': 'sum'})
location = df_data.groupby(['location']).agg({'population': 'mean'})
location_population.div(location, level='location') * 100
location population new_deaths population
Afghanistan 38928341 NaN NaN
Albania 2877800 NaN NaN
但它以 NaN 的形式出现。
如果这些方法有任何问题或如何解决,请提供帮助。谢谢!
你可以做到 -
df = df.groupby(['location']).agg({'new_deaths': sum, 'population': max})
df['rate_death'] = df['new_deaths'] / df['population'] * 100
结果
new_deaths population rate_death
location
Afghanistan 15 38928341 0.000039
Albania 1 2877800 0.000035
我想计算 rate_death 百分比如下 - (new_deaths / population) * 100 按位置分组并求和 new_deaths.
示例:对于阿富汗,rate_death 必须计算为 ((1+4+10) / 38928341) * 100 对于阿尔巴尼亚,它必须计算为 ((0+0+1) / 2877800) * 100
以下是我试过但没有用的数据和方法-
df_data
location date new_cases new_deaths population 0 Afghanistan 4/25/2020 70 1 38928341 1 Afghanistan 4/26/2020 112 4 38928341 2 Afghanistan 4/27/2020 68 10 38928341 3 Albania 4/25/2020 15 0 2877800 4 Albania 4/26/2020 34 0 2877800 5 Albania 4/27/2020 14 1 2877800
Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 location 6 non-null object 1 date 6 non-null object 2 new_cases 6 non-null int64 3 new_deaths 6 non-null int64 4 population 6 non-null int64
方法一:
df_res = df_data[['location','new_deaths','population']].groupby(['location']).sum()
location new_deaths population Afghanistan 15 116785023 Albania 1 8633400
df_res['rate_death'] = (df_res['new_deaths'] / df_res['population'] * 100.0)
location new_deaths population rate_death Afghanistan 15 116785023 0.000 Albania 1 8633400 0.000
我知道由于上面的 groupby 和 'sum' 操作,人口总计两次,但我仍然想知道为什么 rate_death 没有按预期计算百分比,而是显示为 0.000
方法 2:(已按此 post - Pandas percentage of total with groupby 中所述进行尝试)
location_population = df_data.groupby(['location', 'population']).agg({'new_deaths': 'sum'})
location = df_data.groupby(['location']).agg({'population': 'mean'})
location_population.div(location, level='location') * 100
location population new_deaths population Afghanistan 38928341 NaN NaN Albania 2877800 NaN NaN
但它以 NaN 的形式出现。
如果这些方法有任何问题或如何解决,请提供帮助。谢谢!
你可以做到 -
df = df.groupby(['location']).agg({'new_deaths': sum, 'population': max})
df['rate_death'] = df['new_deaths'] / df['population'] * 100
结果
new_deaths population rate_death
location
Afghanistan 15 38928341 0.000039
Albania 1 2877800 0.000035