如何使用 python 中的年增长率来估算缺失值?
How can I impute missing values using a yearly growth rate in python?
我有以下格式的数据集:
Country Code Year Value
0 ABC 32 2000 NaN
1 ABC 32 2001 NaN
2 ABC 32 2002 NaN
3 ABC 32 2003 NaN
4 ABC 32 2004 1000000.0
5 ABC 32 2005 NaN
6 ABC 32 2006 NaN
7 ABC 32 2007 NaN
8 ABC 32 2008 NaN
9 ABC 32 2009 NaN
并且我正在尝试以这样一种方式替换 NaN 值,即它们在非 NaN 值附近显示出每年 r% 的增长;换句话说,对于示例数据,Value[i] 应等于 1000000 * (1+r)^x,其中 x 是非 NaN 值的索引与 i 的索引之间的差值。
对于这个小集合,以下代码可以完成工作:
df['imputed'] = ''
gr = 0.05 # growth rate
for i in range(len(df)):
nx = df.Value.first_valid_index() # index of first non-NaN value
nv = df.Value[df.Value.first_valid_index()] # first non-NaN value
df['imputed'][i] = nv * (1+gr) ** (i - nx)
df
Country Code Year Value imputed
0 ABC 32 2000 NaN 822702
1 ABC 32 2001 NaN 863838
2 ABC 32 2002 NaN 907029
3 ABC 32 2003 NaN 952381
4 ABC 32 2004 1000000.0 1e+06
5 ABC 32 2005 NaN 1.05e+06
6 ABC 32 2006 NaN 1.1025e+06
7 ABC 32 2007 NaN 1.15763e+06
8 ABC 32 2008 NaN 1.21551e+06
9 ABC 32 2009 NaN 1.27628e+06
然而,真实数据集有多个 'Country' 和 'Code' 的组合,需要类似的计算(注意:这些组合中的每一个都只有一个非 NaN 值,就像上面一样)。
如果我使用所有必需的国家/地区代码组合创建新的 df (df2),我如何将上述计算应用于主 df 中的每个匹配组合?请注意,还有许多组合不需要此类计算。
df2
Country Code
0 ABC 32
1 DEF 27
2 GHI 19
您可以从有关国家或任何其他方面的整个数据中只处理过滤后的数据框,然后您可以将它们附加或合并在一起。我只是在这里介绍方法。请随意使用下面的代码,并对其进行定制以获得更优化的解决方案。
代码:
df2 = pd.DataFrame(columns = cols)
df2['Country'] = np.array([(c*10).split() for c in ['ABC ', 'DEF ', 'GHI ']]).ravel()
df2['Code'] = np.array([(c*10).split() for c in ['32 ' , '27 ', '19 ']]).ravel()
df2['Year'] = np.arange(2000,2010).tolist() * 3
df2['Value'] = np.nan
df2.loc[[4,14,24],'Value'] = [1000000.0, 2000000.0, 3000000.0]
# print(df2)
df2.drop('id', axis=1, inplace=True)
df.Value = df.Value.apply(lambda x: np.nan if x == 'NaN' else float(x))
df2['imputed'] = 0
def process(df):
for i in range(len(df)):
nx = df.Value.first_valid_index() # index of first non-NaN value
nv = df.Value.loc[nx] # first non-NaN value
# print(nv,gr,i,nx)
df.loc[i,'imputed'] = nv * ((1+gr) ** (i - nx))
return df
new_df = pd.DataFrame()
for c in df2.Country.unique():
cond = (df2.Country == c)
p_df = df2[cond].copy()
p_df.reset_index(drop=True,inplace=True)
df_ = process(p_df)
new_df = new_df.append(df_, ignore_index=True)
print(new_df)
输出:
Country Code Year Value imputed
0 ABC 32 2000 NaN 8.227025e+05
1 ABC 32 2001 NaN 8.638376e+05
2 ABC 32 2002 NaN 9.070295e+05
3 ABC 32 2003 NaN 9.523810e+05
4 ABC 32 2004 1000000.0 1.000000e+06
5 ABC 32 2005 NaN 1.050000e+06
6 ABC 32 2006 NaN 1.102500e+06
7 ABC 32 2007 NaN 1.157625e+06
8 ABC 32 2008 NaN 1.215506e+06
9 ABC 32 2009 NaN 1.276282e+06
10 DEF 27 2000 NaN 1.645405e+06
11 DEF 27 2001 NaN 1.727675e+06
12 DEF 27 2002 NaN 1.814059e+06
13 DEF 27 2003 NaN 1.904762e+06
14 DEF 27 2004 2000000.0 2.000000e+06
15 DEF 27 2005 NaN 2.100000e+06
16 DEF 27 2006 NaN 2.205000e+06
17 DEF 27 2007 NaN 2.315250e+06
18 DEF 27 2008 NaN 2.431013e+06
19 DEF 27 2009 NaN 2.552563e+06
20 GHI 19 2000 NaN 2.468107e+06
21 GHI 19 2001 NaN 2.591513e+06
22 GHI 19 2002 NaN 2.721088e+06
23 GHI 19 2003 NaN 2.857143e+06
24 GHI 19 2004 3000000.0 3.000000e+06
25 GHI 19 2005 NaN 3.150000e+06
26 GHI 19 2006 NaN 3.307500e+06
27 GHI 19 2007 NaN 3.472875e+06
28 GHI 19 2008 NaN 3.646519e+06
29 GHI 19 2009 NaN 3.828845e+06
我有以下格式的数据集:
Country Code Year Value
0 ABC 32 2000 NaN
1 ABC 32 2001 NaN
2 ABC 32 2002 NaN
3 ABC 32 2003 NaN
4 ABC 32 2004 1000000.0
5 ABC 32 2005 NaN
6 ABC 32 2006 NaN
7 ABC 32 2007 NaN
8 ABC 32 2008 NaN
9 ABC 32 2009 NaN
并且我正在尝试以这样一种方式替换 NaN 值,即它们在非 NaN 值附近显示出每年 r% 的增长;换句话说,对于示例数据,Value[i] 应等于 1000000 * (1+r)^x,其中 x 是非 NaN 值的索引与 i 的索引之间的差值。
对于这个小集合,以下代码可以完成工作:
df['imputed'] = ''
gr = 0.05 # growth rate
for i in range(len(df)):
nx = df.Value.first_valid_index() # index of first non-NaN value
nv = df.Value[df.Value.first_valid_index()] # first non-NaN value
df['imputed'][i] = nv * (1+gr) ** (i - nx)
df
Country Code Year Value imputed
0 ABC 32 2000 NaN 822702
1 ABC 32 2001 NaN 863838
2 ABC 32 2002 NaN 907029
3 ABC 32 2003 NaN 952381
4 ABC 32 2004 1000000.0 1e+06
5 ABC 32 2005 NaN 1.05e+06
6 ABC 32 2006 NaN 1.1025e+06
7 ABC 32 2007 NaN 1.15763e+06
8 ABC 32 2008 NaN 1.21551e+06
9 ABC 32 2009 NaN 1.27628e+06
然而,真实数据集有多个 'Country' 和 'Code' 的组合,需要类似的计算(注意:这些组合中的每一个都只有一个非 NaN 值,就像上面一样)。
如果我使用所有必需的国家/地区代码组合创建新的 df (df2),我如何将上述计算应用于主 df 中的每个匹配组合?请注意,还有许多组合不需要此类计算。
df2
Country Code
0 ABC 32
1 DEF 27
2 GHI 19
您可以从有关国家或任何其他方面的整个数据中只处理过滤后的数据框,然后您可以将它们附加或合并在一起。我只是在这里介绍方法。请随意使用下面的代码,并对其进行定制以获得更优化的解决方案。
代码:
df2 = pd.DataFrame(columns = cols)
df2['Country'] = np.array([(c*10).split() for c in ['ABC ', 'DEF ', 'GHI ']]).ravel()
df2['Code'] = np.array([(c*10).split() for c in ['32 ' , '27 ', '19 ']]).ravel()
df2['Year'] = np.arange(2000,2010).tolist() * 3
df2['Value'] = np.nan
df2.loc[[4,14,24],'Value'] = [1000000.0, 2000000.0, 3000000.0]
# print(df2)
df2.drop('id', axis=1, inplace=True)
df.Value = df.Value.apply(lambda x: np.nan if x == 'NaN' else float(x))
df2['imputed'] = 0
def process(df):
for i in range(len(df)):
nx = df.Value.first_valid_index() # index of first non-NaN value
nv = df.Value.loc[nx] # first non-NaN value
# print(nv,gr,i,nx)
df.loc[i,'imputed'] = nv * ((1+gr) ** (i - nx))
return df
new_df = pd.DataFrame()
for c in df2.Country.unique():
cond = (df2.Country == c)
p_df = df2[cond].copy()
p_df.reset_index(drop=True,inplace=True)
df_ = process(p_df)
new_df = new_df.append(df_, ignore_index=True)
print(new_df)
输出:
Country Code Year Value imputed
0 ABC 32 2000 NaN 8.227025e+05
1 ABC 32 2001 NaN 8.638376e+05
2 ABC 32 2002 NaN 9.070295e+05
3 ABC 32 2003 NaN 9.523810e+05
4 ABC 32 2004 1000000.0 1.000000e+06
5 ABC 32 2005 NaN 1.050000e+06
6 ABC 32 2006 NaN 1.102500e+06
7 ABC 32 2007 NaN 1.157625e+06
8 ABC 32 2008 NaN 1.215506e+06
9 ABC 32 2009 NaN 1.276282e+06
10 DEF 27 2000 NaN 1.645405e+06
11 DEF 27 2001 NaN 1.727675e+06
12 DEF 27 2002 NaN 1.814059e+06
13 DEF 27 2003 NaN 1.904762e+06
14 DEF 27 2004 2000000.0 2.000000e+06
15 DEF 27 2005 NaN 2.100000e+06
16 DEF 27 2006 NaN 2.205000e+06
17 DEF 27 2007 NaN 2.315250e+06
18 DEF 27 2008 NaN 2.431013e+06
19 DEF 27 2009 NaN 2.552563e+06
20 GHI 19 2000 NaN 2.468107e+06
21 GHI 19 2001 NaN 2.591513e+06
22 GHI 19 2002 NaN 2.721088e+06
23 GHI 19 2003 NaN 2.857143e+06
24 GHI 19 2004 3000000.0 3.000000e+06
25 GHI 19 2005 NaN 3.150000e+06
26 GHI 19 2006 NaN 3.307500e+06
27 GHI 19 2007 NaN 3.472875e+06
28 GHI 19 2008 NaN 3.646519e+06
29 GHI 19 2009 NaN 3.828845e+06