Pandas 每组缺失值插补
Pandas per group imputation of missing values
如何为 pandas 中的每个指标实现这样的按国家/地区归集?
我想估算每组的缺失值
- no-A-state 每个指标 KPI
应该得到 np.min
- no-ISO-state 每个指标 KPI
应该得到 np.mean
对于具有缺失值的状态,我想用每个 indicatorKPI
均值进行估算。在这里,这意味着要估算 Serbia
的缺失值
mydf = pd.DataFrame({'Country':['no-A-state','no-ISO-state','germany','serbia','austria' , 'germany','serbia','austria','indicatorKPI':[np.nan,np.nan,'SP.DYN.LE00.IN','NY.GDP.MKTP.CD','NY.GDP.MKTP.CD','SP.DYN.LE00.IN','NY.GDP.MKTP.CD','SP.DYN.LE00.IN'],'value':[np.nan,np.nan,0.9,np.nan,0.7,0.2,0.3,0.6]})
编辑
所需的输出应该类似于
mydf = pd.DataFrame({'Country':['no-A-state','no-ISO-state', 'no-A-state','no-ISO-state',
'germany','serbia','serbia', 'austria',
'germany','serbia', 'austria',],
'indicatorKPI':['SP.DYN.LE00.IN','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN',
'SP.DYN.LE00.IN','NY.GDP.MKTP.CD','SP.DYN.LE00.IN','NY.GDP.MKTP.CD','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN'],
'value':['MIN of all for this indicator', 'MEAN of all for this indicator','MIN of all for this indicator','MEAN of all for this indicator', 0.9,'MEAN of all for SP.DYN.LE00.IN indicator',0.7, 'MEAN of all for NY.GDP.MKTP.CD indicator',0.2, 0.3, 0.6]
})
根据您的新示例 df,以下对我有效:
In [185]:
mydf.loc[mydf['Country'] == 'no-A-state', 'value'] = mydf['value'].min()
mydf.loc[mydf['Country'] == 'no-ISO-state', 'value'] = mydf['value'].mean()
mydf.loc[mydf['value'].isnull(), 'value'] = mydf['indicatorKPI'].map(mydf.groupby('indicatorKPI')['value'].mean())
mydf
Out[185]:
Country indicatorKPI value
0 no-A-state SP.DYN.LE00.IN 0.200000
1 no-ISO-state NY.GDP.MKTP.CD 0.442857
2 no-A-state SP.DYN.LE00.IN 0.200000
3 no-ISO-state SP.DYN.LE00.IN 0.442857
4 germany NY.GDP.MKTP.CD 0.900000
5 serbia SP.DYN.LE00.IN 0.328571
6 serbia NY.GDP.MKTP.CD 0.700000
7 austria NY.GDP.MKTP.CD 0.585714
8 germany SP.DYN.LE00.IN 0.200000
9 serbia NY.GDP.MKTP.CD 0.300000
10 austria SP.DYN.LE00.IN 0.600000
基本上这是为每个条件填充缺失值,因此我们为 'no-A-state' 个国家/地区设置最小值,然后为 'no-ISO-state' 个国家/地区设置平均值。然后我们在 'indicatorKPI' 上分组并计算每个组的平均值并再次分配给空值行,各个国家的平均值使用 map
执行查找
以下是分解的步骤:
In [187]:
mydf.groupby('indicatorKPI')['value'].mean()
Out[187]:
indicatorKPI
NY.GDP.MKTP.CD 0.633333
SP.DYN.LE00.IN 0.400000
Name: value, dtype: float64
In [188]:
mydf['indicatorKPI'].map(mydf.groupby('indicatorKPI')['value'].mean())
Out[188]:
0 0.400000
1 0.633333
2 0.400000
3 0.400000
4 0.633333
5 0.400000
6 0.633333
7 0.633333
8 0.400000
9 0.633333
10 0.400000
Name: indicatorKPI, dtype: float64
如何为 pandas 中的每个指标实现这样的按国家/地区归集?
我想估算每组的缺失值
- no-A-state 每个指标 KPI 应该得到
- no-ISO-state 每个指标 KPI 应该得到
对于具有缺失值的状态,我想用每个
的缺失值indicatorKPI
均值进行估算。在这里,这意味着要估算 Serbiamydf = pd.DataFrame({'Country':['no-A-state','no-ISO-state','germany','serbia','austria' , 'germany','serbia','austria','indicatorKPI':[np.nan,np.nan,'SP.DYN.LE00.IN','NY.GDP.MKTP.CD','NY.GDP.MKTP.CD','SP.DYN.LE00.IN','NY.GDP.MKTP.CD','SP.DYN.LE00.IN'],'value':[np.nan,np.nan,0.9,np.nan,0.7,0.2,0.3,0.6]})
np.min
np.mean
编辑
所需的输出应该类似于
mydf = pd.DataFrame({'Country':['no-A-state','no-ISO-state', 'no-A-state','no-ISO-state',
'germany','serbia','serbia', 'austria',
'germany','serbia', 'austria',],
'indicatorKPI':['SP.DYN.LE00.IN','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN',
'SP.DYN.LE00.IN','NY.GDP.MKTP.CD','SP.DYN.LE00.IN','NY.GDP.MKTP.CD','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN'],
'value':['MIN of all for this indicator', 'MEAN of all for this indicator','MIN of all for this indicator','MEAN of all for this indicator', 0.9,'MEAN of all for SP.DYN.LE00.IN indicator',0.7, 'MEAN of all for NY.GDP.MKTP.CD indicator',0.2, 0.3, 0.6]
})
根据您的新示例 df,以下对我有效:
In [185]:
mydf.loc[mydf['Country'] == 'no-A-state', 'value'] = mydf['value'].min()
mydf.loc[mydf['Country'] == 'no-ISO-state', 'value'] = mydf['value'].mean()
mydf.loc[mydf['value'].isnull(), 'value'] = mydf['indicatorKPI'].map(mydf.groupby('indicatorKPI')['value'].mean())
mydf
Out[185]:
Country indicatorKPI value
0 no-A-state SP.DYN.LE00.IN 0.200000
1 no-ISO-state NY.GDP.MKTP.CD 0.442857
2 no-A-state SP.DYN.LE00.IN 0.200000
3 no-ISO-state SP.DYN.LE00.IN 0.442857
4 germany NY.GDP.MKTP.CD 0.900000
5 serbia SP.DYN.LE00.IN 0.328571
6 serbia NY.GDP.MKTP.CD 0.700000
7 austria NY.GDP.MKTP.CD 0.585714
8 germany SP.DYN.LE00.IN 0.200000
9 serbia NY.GDP.MKTP.CD 0.300000
10 austria SP.DYN.LE00.IN 0.600000
基本上这是为每个条件填充缺失值,因此我们为 'no-A-state' 个国家/地区设置最小值,然后为 'no-ISO-state' 个国家/地区设置平均值。然后我们在 'indicatorKPI' 上分组并计算每个组的平均值并再次分配给空值行,各个国家的平均值使用 map
执行查找
以下是分解的步骤:
In [187]:
mydf.groupby('indicatorKPI')['value'].mean()
Out[187]:
indicatorKPI
NY.GDP.MKTP.CD 0.633333
SP.DYN.LE00.IN 0.400000
Name: value, dtype: float64
In [188]:
mydf['indicatorKPI'].map(mydf.groupby('indicatorKPI')['value'].mean())
Out[188]:
0 0.400000
1 0.633333
2 0.400000
3 0.400000
4 0.633333
5 0.400000
6 0.633333
7 0.633333
8 0.400000
9 0.633333
10 0.400000
Name: indicatorKPI, dtype: float64