用 pandas reindex 函数填充缺失的数据行
To fill the missing data lines with pandas reindex function
我正在尝试使用 pandas 重新索引函数来填充我的时间序列数据中缺失的行。
我的数据如下:
100,2007,239,4,29.588,-30.851,-999.0,-999.0,-999.0,-999.00,13.125,-999.00
100,2007,239,5,29.573,-30.843,-999.0,-999.0,-999.0,-999.00,13.126,-999.00
100,2007,239,14,29.389,-30.880,-999.0,-999.0,-999.0,-999.00,13.131,-999.00
100,2007,239,15,29.367,-30.901,-999.0,-999.0,-999.0,-999.00,13.131,-999.00
100,2007,239,24,29.374,-30.920,-999.0,-999.0,-999.0,-999.00,13.135,-999.00
第四列表示的是一天时间间隔一分钟的时间序列数据。与正常的时间序列索引不同,此数据的时间索引看起来像 0 到 59、100 到 159 ....2300 到 2359,因为 1 天是 24 小时,1 小时是 60 分钟。所以,用 'nan' 值填补空白,我将代码编写如下:
S = []
for i in range(0,24):
s = np.arange(i*100,i*100+60)
s = list(s)
S = S + s
pd.set_option('max_rows',10)
for INPUT in FileList:
output = INPUT + "result" # set the output files
data=pd.read_csv(INPUT,sep=',',index_col=[3],parse_dates=[3])
index = 'S'#make the reference index to fill
df = data
sk_f = df.reindex(index)
sk_f.to_csv(output,na_rep='nan')
通过这段代码,我打算以参考索引S为基础,在第四列索引后面的'nan'行填补空白。
但结果只是 'nan' 的行,而不是像下面那样填充空白:
,100,2007,241,22.471,-31.002,-999.0,-999.0.1,-999.0.2,-999.00,13.294,-999.00 .1
0,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
1,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
2,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
3,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
4,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
5,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
6,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
7,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
8,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
9,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
10,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
11,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
我的期望是填补原始数据中缺失行的空白。例如,在原始数据中,0 到 3 索引线之间没有低点。所以我想用原始数据格式填充这些行。
我可能会错过一些东西。
首先,我发现创建列表 S = S + s
的缩进有问题。你必须使用,因为列表 S
只保留最后 s
:
S = []
for i in range(0,24):
s = np.arange(i*100,i*100+60)
s = list(s)
S = S + s #keep only last s
至:
S = []
for i in range(0,24):
s = np.arange(i*100,i*100+60)
s = list(s)
S = S + s
或更短:
S = []
for i in range(0,24):
S = S + list(np.arange(i*100,i*100+60))
下一个有问题index = 'S'
我想,打错了也可以index = S
。
您可以添加函数 bfill()
并向后填充空白。 link
sk_f = df.reindex(index).bfill()
代码:
import pandas as pd
import numpy as np
import io
S = []
for i in range(0,24):
S = S + list(np.arange(i*100,i*100+60))
#original data
temp=u"""100,2007,239,4,29.588,-30.851,-999.0,-999.0,-999.0,-999.00,13.125,-999.00
100,2007,239,5,29.573,-30.843,-999.0,-999.0,-999.0,-999.00,13.126,-999.00
100,2007,239,14,29.389,-30.880,-999.0,-999.0,-999.0,-999.00,13.131,-999.00
100,2007,239,15,29.367,-30.901,-999.0,-999.0,-999.0,-999.00,13.131,-999.00
100,2007,239,24,29.374,-30.920,-999.0,-999.0,-999.0,-999.00,13.135,-999.00"""
#pd.set_option('max_rows',10)
data=pd.read_csv(io.StringIO(temp),sep=',', header=None, index_col=[3], parse_dates=[3])
data.index.name = None
print data
# 0 1 2 4 5 6 7 8 9 10 11
#4 100 2007 239 29.588 -30.851 -999 -999 -999 -999 13.125 -999
#5 100 2007 239 29.573 -30.843 -999 -999 -999 -999 13.126 -999
#14 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#15 100 2007 239 29.367 -30.901 -999 -999 -999 -999 13.131 -999
#24 100 2007 239 29.374 -30.920 -999 -999 -999 -999 13.135 -999
index = S #make the reference index to fill
df = data
sk_f = df.reindex(index).bfill()
print sk_f.head(20)
# 0 1 2 4 5 6 7 8 9 10 11
#0 100 2007 239 29.588 -30.851 -999 -999 -999 -999 13.125 -999
#1 100 2007 239 29.588 -30.851 -999 -999 -999 -999 13.125 -999
#2 100 2007 239 29.588 -30.851 -999 -999 -999 -999 13.125 -999
#3 100 2007 239 29.588 -30.851 -999 -999 -999 -999 13.125 -999
#4 100 2007 239 29.588 -30.851 -999 -999 -999 -999 13.125 -999
#5 100 2007 239 29.573 -30.843 -999 -999 -999 -999 13.126 -999
#6 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#7 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#8 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#9 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#10 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#11 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#12 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#13 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#14 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#15 100 2007 239 29.367 -30.901 -999 -999 -999 -999 13.131 -999
#16 100 2007 239 29.374 -30.920 -999 -999 -999 -999 13.135 -999
#17 100 2007 239 29.374 -30.920 -999 -999 -999 -999 13.135 -999
#18 100 2007 239 29.374 -30.920 -999 -999 -999 -999 13.135 -999
#19 100 2007 239 29.374 -30.920 -999 -999 -999 -999 13.135 -999
我正在尝试使用 pandas 重新索引函数来填充我的时间序列数据中缺失的行。
我的数据如下:
100,2007,239,4,29.588,-30.851,-999.0,-999.0,-999.0,-999.00,13.125,-999.00
100,2007,239,5,29.573,-30.843,-999.0,-999.0,-999.0,-999.00,13.126,-999.00
100,2007,239,14,29.389,-30.880,-999.0,-999.0,-999.0,-999.00,13.131,-999.00
100,2007,239,15,29.367,-30.901,-999.0,-999.0,-999.0,-999.00,13.131,-999.00
100,2007,239,24,29.374,-30.920,-999.0,-999.0,-999.0,-999.00,13.135,-999.00
第四列表示的是一天时间间隔一分钟的时间序列数据。与正常的时间序列索引不同,此数据的时间索引看起来像 0 到 59、100 到 159 ....2300 到 2359,因为 1 天是 24 小时,1 小时是 60 分钟。所以,用 'nan' 值填补空白,我将代码编写如下:
S = []
for i in range(0,24):
s = np.arange(i*100,i*100+60)
s = list(s)
S = S + s
pd.set_option('max_rows',10)
for INPUT in FileList:
output = INPUT + "result" # set the output files
data=pd.read_csv(INPUT,sep=',',index_col=[3],parse_dates=[3])
index = 'S'#make the reference index to fill
df = data
sk_f = df.reindex(index)
sk_f.to_csv(output,na_rep='nan')
通过这段代码,我打算以参考索引S为基础,在第四列索引后面的'nan'行填补空白。 但结果只是 'nan' 的行,而不是像下面那样填充空白:
,100,2007,241,22.471,-31.002,-999.0,-999.0.1,-999.0.2,-999.00,13.294,-999.00 .1
0,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
1,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
2,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
3,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
4,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
5,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
6,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
7,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
8,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
9,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
10,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
11,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
我的期望是填补原始数据中缺失行的空白。例如,在原始数据中,0 到 3 索引线之间没有低点。所以我想用原始数据格式填充这些行。 我可能会错过一些东西。
首先,我发现创建列表 S = S + s
的缩进有问题。你必须使用,因为列表 S
只保留最后 s
:
S = []
for i in range(0,24):
s = np.arange(i*100,i*100+60)
s = list(s)
S = S + s #keep only last s
至:
S = []
for i in range(0,24):
s = np.arange(i*100,i*100+60)
s = list(s)
S = S + s
或更短:
S = []
for i in range(0,24):
S = S + list(np.arange(i*100,i*100+60))
下一个有问题index = 'S'
我想,打错了也可以index = S
。
您可以添加函数 bfill()
并向后填充空白。 link
sk_f = df.reindex(index).bfill()
代码:
import pandas as pd
import numpy as np
import io
S = []
for i in range(0,24):
S = S + list(np.arange(i*100,i*100+60))
#original data
temp=u"""100,2007,239,4,29.588,-30.851,-999.0,-999.0,-999.0,-999.00,13.125,-999.00
100,2007,239,5,29.573,-30.843,-999.0,-999.0,-999.0,-999.00,13.126,-999.00
100,2007,239,14,29.389,-30.880,-999.0,-999.0,-999.0,-999.00,13.131,-999.00
100,2007,239,15,29.367,-30.901,-999.0,-999.0,-999.0,-999.00,13.131,-999.00
100,2007,239,24,29.374,-30.920,-999.0,-999.0,-999.0,-999.00,13.135,-999.00"""
#pd.set_option('max_rows',10)
data=pd.read_csv(io.StringIO(temp),sep=',', header=None, index_col=[3], parse_dates=[3])
data.index.name = None
print data
# 0 1 2 4 5 6 7 8 9 10 11
#4 100 2007 239 29.588 -30.851 -999 -999 -999 -999 13.125 -999
#5 100 2007 239 29.573 -30.843 -999 -999 -999 -999 13.126 -999
#14 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#15 100 2007 239 29.367 -30.901 -999 -999 -999 -999 13.131 -999
#24 100 2007 239 29.374 -30.920 -999 -999 -999 -999 13.135 -999
index = S #make the reference index to fill
df = data
sk_f = df.reindex(index).bfill()
print sk_f.head(20)
# 0 1 2 4 5 6 7 8 9 10 11
#0 100 2007 239 29.588 -30.851 -999 -999 -999 -999 13.125 -999
#1 100 2007 239 29.588 -30.851 -999 -999 -999 -999 13.125 -999
#2 100 2007 239 29.588 -30.851 -999 -999 -999 -999 13.125 -999
#3 100 2007 239 29.588 -30.851 -999 -999 -999 -999 13.125 -999
#4 100 2007 239 29.588 -30.851 -999 -999 -999 -999 13.125 -999
#5 100 2007 239 29.573 -30.843 -999 -999 -999 -999 13.126 -999
#6 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#7 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#8 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#9 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#10 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#11 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#12 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#13 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#14 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#15 100 2007 239 29.367 -30.901 -999 -999 -999 -999 13.131 -999
#16 100 2007 239 29.374 -30.920 -999 -999 -999 -999 13.135 -999
#17 100 2007 239 29.374 -30.920 -999 -999 -999 -999 13.135 -999
#18 100 2007 239 29.374 -30.920 -999 -999 -999 -999 13.135 -999
#19 100 2007 239 29.374 -30.920 -999 -999 -999 -999 13.135 -999