Pandas:估算给定数量的缺失值 before/after 一系列可用值
Pandas: Impute a given number of missing values before/after a series of available values
假设我有一个时间序列,其中我通常有某个连续年份的可用数据,但在该跨度前后缺少值,如下所示:
df = pd.DataFrame({'year': ["2000","2001","2002", "2003","2004", "2005","2006", "2007"], 'cakes eaten': [np.nan, np.nan, np.nan, 3, 4, 5, np.nan, np.nan]})
print(df)
year cakes eaten
0 2000 NaN
1 2001 NaN
2 2002 NaN
3 2003 3.0
4 2004 4.0
5 2005 5.0
6 2006 NaN
7 2007 NaN
有没有办法根据可用值中看到的趋势来填充(给定数量的)缺失值?
假设我想在每个方向上最多填充 2 个值,结果必须如下所示:
year cakes eaten
0 2000 NaN
1 2001 1.0
2 2002 2.0
3 2003 3.0
4 2004 4.0
5 2005 5.0
6 2006 6.0
7 2007 7.0
另外: 有没有办法确保只有在有足够的可用值时才执行此插补,例如我只想在中填充最多 2 个值每个方向,如果至少有 3 个可用值(或者更一般地说,仅当 n + m 可用时才填写 n)?
我会使用提到的 interpolate()。您可以使用多种方法来产生不同的结果。我使用 krogh
方法得到线性趋势线。 limit_direction='both'
需要填充两个方向的趋势:
test_dict = {'col': [np.nan, np.nan,np.nan, np.nan, np.nan, 4, 5, 6 ,np.nan]}
df = pd.DataFrame(test_dict)
df['trend'] = df['col'].interpolate(method='krogh', limit_direction='both')
col trend
0 NaN -1.0
1 NaN 0.0
2 NaN 1.0
3 NaN 2.0
4 NaN 3.0
5 4.0 4.0
6 5.0 5.0
7 6.0 6.0
8 NaN 7.0
完成后,您可以删除不需要的 below 0
趋势值。
感谢@olv1do 向我展示了 interpolate() 做我想做的事情。
使用插值和 .first_valid_index
和 .last_valid_index
允许实现所需的行为:
#impute n values in both directions if at least m values are available
def interpolate(data, n, m):
first_valid = data['cakes eaten'].first_valid_index()
last_valid = data['cakes eaten'].last_valid_index()
if(abs(first_valid - last_valid) + 1 >= m):
data['imputed'] = data['cakes eaten'].interpolate(method='spline',order = 1, limit_direction='both', limit = n)
return data
问题中的例子:
df = pd.DataFrame({'year': ["2000","2001","2002", "2003","2004", "2005","2006", "2007"], 'cakes eaten': [np.nan, np.nan, np.nan, 3, 4, 5, np.nan, np.nan]})
interpolate(df, 2,3)
year cakes eaten imputed
0 2000 NaN NaN
1 2001 NaN 1.0
2 2002 NaN 2.0
3 2003 3.0 3.0
4 2004 4.0 4.0
5 2005 5.0 5.0
6 2006 NaN 6.0
7 2007 NaN 7.0
如果可用值少于 m 个,则不执行任何操作:
df = pd.DataFrame({'year': ["2000","2001","2002", "2003","2004", "2005","2006", "2007"], 'cakes eaten': [np.nan, np.nan, np.nan, 3, 4, np.nan, np.nan, np.nan]})
interpolate(df, 2,3)
year cakes eaten
0 2000 NaN
1 2001 NaN
2 2002 NaN
3 2003 3.0
4 2004 4.0
5 2005 NaN
6 2006 NaN
7 2007 NaN
此外,如果值不像我的示例那样完全线性,spline
方法也能很好地工作:
df = pd.DataFrame({'year': ["2000","2001","2002", "2003","2004", "2005","2006", "2007"], 'cakes eaten': [np.nan, np.nan, 1, 4, 2, 3, np.nan, np.nan]})
interpolate(df, 1,4)
year cakes eaten imputed
0 2000 NaN NaN
1 2001 NaN 1.381040
2 2002 1.0 1.000000
3 2003 4.0 4.000000
4 2004 2.0 2.000000
5 2005 3.0 3.000000
6 2006 NaN 3.433167
7 2007 NaN NaN
假设我有一个时间序列,其中我通常有某个连续年份的可用数据,但在该跨度前后缺少值,如下所示:
df = pd.DataFrame({'year': ["2000","2001","2002", "2003","2004", "2005","2006", "2007"], 'cakes eaten': [np.nan, np.nan, np.nan, 3, 4, 5, np.nan, np.nan]})
print(df)
year cakes eaten
0 2000 NaN
1 2001 NaN
2 2002 NaN
3 2003 3.0
4 2004 4.0
5 2005 5.0
6 2006 NaN
7 2007 NaN
有没有办法根据可用值中看到的趋势来填充(给定数量的)缺失值?
假设我想在每个方向上最多填充 2 个值,结果必须如下所示:
year cakes eaten
0 2000 NaN
1 2001 1.0
2 2002 2.0
3 2003 3.0
4 2004 4.0
5 2005 5.0
6 2006 6.0
7 2007 7.0
另外: 有没有办法确保只有在有足够的可用值时才执行此插补,例如我只想在中填充最多 2 个值每个方向,如果至少有 3 个可用值(或者更一般地说,仅当 n + m 可用时才填写 n)?
我会使用提到的 interpolate()。您可以使用多种方法来产生不同的结果。我使用 krogh
方法得到线性趋势线。 limit_direction='both'
需要填充两个方向的趋势:
test_dict = {'col': [np.nan, np.nan,np.nan, np.nan, np.nan, 4, 5, 6 ,np.nan]}
df = pd.DataFrame(test_dict)
df['trend'] = df['col'].interpolate(method='krogh', limit_direction='both')
col trend
0 NaN -1.0
1 NaN 0.0
2 NaN 1.0
3 NaN 2.0
4 NaN 3.0
5 4.0 4.0
6 5.0 5.0
7 6.0 6.0
8 NaN 7.0
完成后,您可以删除不需要的 below 0
趋势值。
感谢@olv1do 向我展示了 interpolate() 做我想做的事情。
使用插值和 .first_valid_index
和 .last_valid_index
允许实现所需的行为:
#impute n values in both directions if at least m values are available
def interpolate(data, n, m):
first_valid = data['cakes eaten'].first_valid_index()
last_valid = data['cakes eaten'].last_valid_index()
if(abs(first_valid - last_valid) + 1 >= m):
data['imputed'] = data['cakes eaten'].interpolate(method='spline',order = 1, limit_direction='both', limit = n)
return data
问题中的例子:
df = pd.DataFrame({'year': ["2000","2001","2002", "2003","2004", "2005","2006", "2007"], 'cakes eaten': [np.nan, np.nan, np.nan, 3, 4, 5, np.nan, np.nan]})
interpolate(df, 2,3)
year cakes eaten imputed
0 2000 NaN NaN
1 2001 NaN 1.0
2 2002 NaN 2.0
3 2003 3.0 3.0
4 2004 4.0 4.0
5 2005 5.0 5.0
6 2006 NaN 6.0
7 2007 NaN 7.0
如果可用值少于 m 个,则不执行任何操作:
df = pd.DataFrame({'year': ["2000","2001","2002", "2003","2004", "2005","2006", "2007"], 'cakes eaten': [np.nan, np.nan, np.nan, 3, 4, np.nan, np.nan, np.nan]})
interpolate(df, 2,3)
year cakes eaten
0 2000 NaN
1 2001 NaN
2 2002 NaN
3 2003 3.0
4 2004 4.0
5 2005 NaN
6 2006 NaN
7 2007 NaN
此外,如果值不像我的示例那样完全线性,spline
方法也能很好地工作:
df = pd.DataFrame({'year': ["2000","2001","2002", "2003","2004", "2005","2006", "2007"], 'cakes eaten': [np.nan, np.nan, 1, 4, 2, 3, np.nan, np.nan]})
interpolate(df, 1,4)
year cakes eaten imputed
0 2000 NaN NaN
1 2001 NaN 1.381040
2 2002 1.0 1.000000
3 2003 4.0 4.000000
4 2004 2.0 2.000000
5 2005 3.0 3.000000
6 2006 NaN 3.433167
7 2007 NaN NaN