Pandas 通过索引从 HDF5 获取特定行
Pandas get specific rows from HDF5 by index
我有一个已写入 HDF5 文件的 pandas DataFrame。数据按时间戳索引,如下所示:
In [5]: df
Out[5]:
Codes Price Size
Time
2015-04-27 01:31:08-04:00 T 111.75 23
2015-04-27 01:31:39-04:00 T 111.80 23
2015-04-27 01:31:39-04:00 T 113.00 35
2015-04-27 01:34:14-04:00 T 113.00 85
2015-04-27 01:55:15-04:00 T 113.50 203
... ... ... ...
2015-05-26 11:35:00-04:00 CA 110.55 196
2015-05-26 11:35:00-04:00 CA 110.55 98
2015-05-26 11:35:00-04:00 CA 110.55 738
2015-05-26 11:35:00-04:00 CA 110.55 19
2015-05-26 11:37:01-04:00 110.55 12
我想要创建一个函数,我可以传递一个 pandas DatetimeIndex 并且它将 return 一个 DataFrame,其中的行在每个时间戳之前或之前日期时间索引。
我 运行 遇到的问题是 read_hdf queries won't work if I am looking for more than 30 rows -- see [
我现在做的是这个,但是一定有更好的解决办法:
from pandas import read_hdf, DatetimeIndex
from datetime import timedelta
import pytz
def getRows(file, dataset, index):
if len(index) == 1:
start = index.date[0]
end = (index.date + timedelta(days=1))[0]
else:
start = index.date.min()
end = (index.date.max() + timedelta(days=1))
where = '(index >= "' + str(start) + '") & (index < "' str(end) + '")'
df = read_hdf(file, dataset, where=where)
df = df.groupby(level=0).last().reindex(index, method='pad')
return df
这是使用 where mask
的示例
In [22]: pd.set_option('max_rows',10)
In [23]: df = DataFrame({'A' : np.random.randn(100), 'B' : pd.date_range('20130101',periods=100)}).set_index('B')
In [24]: df
Out[24]:
A
B
2013-01-01 0.493144
2013-01-02 0.421045
2013-01-03 -0.717824
2013-01-04 0.159865
2013-01-05 -0.485890
... ...
2013-04-06 -0.805954
2013-04-07 -1.014333
2013-04-08 0.846877
2013-04-09 -1.646908
2013-04-10 -0.160927
[100 rows x 1 columns]
存储测试帧
In [25]: store = pd.HDFStore('test.h5',mode='w')
In [26]: store.append('df',df)
随机选择日期。
In [27]: dates = df.index.take(np.random.randint(0,100,10))
In [28]: dates
Out[28]: DatetimeIndex(['2013-03-29', '2013-02-16', '2013-01-15', '2013-02-06', '2013-01-12', '2013-02-24', '2013-02-18', '2013-01-06', '2013-03-17', '2013-03-21'], dtype='datetime64[ns]', name=u'B', freq=None, tz=None)
Select 索引列(完整)
In [29]: c = store.select_column('df','index')
In [30]: c
Out[30]:
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
4 2013-01-05
...
95 2013-04-06
96 2013-04-07
97 2013-04-08
98 2013-04-09
99 2013-04-10
Name: B, dtype: datetime64[ns]
Select 您想要的索引器。这实际上可能有些复杂,例如你可能想要 .reindex(method='nearest')
In [34]: c[c.isin(dates)]
Out[34]:
5 2013-01-06
11 2013-01-12
14 2013-01-15
36 2013-02-06
46 2013-02-16
48 2013-02-18
54 2013-02-24
75 2013-03-17
79 2013-03-21
87 2013-03-29
Name: B, dtype: datetime64[ns]
Select 您想要的行
In [32]: store.select('df',where=c[c.isin(dates)].index)
Out[32]:
A
B
2013-01-06 0.680930
2013-01-12 0.165923
2013-01-15 -0.517692
2013-02-06 -0.351020
2013-02-16 1.348973
2013-02-18 0.448890
2013-02-24 -1.078522
2013-03-17 -0.358597
2013-03-21 -0.482301
2013-03-29 0.343381
In [33]: store.close()
我有一个已写入 HDF5 文件的 pandas DataFrame。数据按时间戳索引,如下所示:
In [5]: df
Out[5]:
Codes Price Size
Time
2015-04-27 01:31:08-04:00 T 111.75 23
2015-04-27 01:31:39-04:00 T 111.80 23
2015-04-27 01:31:39-04:00 T 113.00 35
2015-04-27 01:34:14-04:00 T 113.00 85
2015-04-27 01:55:15-04:00 T 113.50 203
... ... ... ...
2015-05-26 11:35:00-04:00 CA 110.55 196
2015-05-26 11:35:00-04:00 CA 110.55 98
2015-05-26 11:35:00-04:00 CA 110.55 738
2015-05-26 11:35:00-04:00 CA 110.55 19
2015-05-26 11:37:01-04:00 110.55 12
我想要创建一个函数,我可以传递一个 pandas DatetimeIndex 并且它将 return 一个 DataFrame,其中的行在每个时间戳之前或之前日期时间索引。
我 运行 遇到的问题是 read_hdf queries won't work if I am looking for more than 30 rows -- see [
我现在做的是这个,但是一定有更好的解决办法:
from pandas import read_hdf, DatetimeIndex
from datetime import timedelta
import pytz
def getRows(file, dataset, index):
if len(index) == 1:
start = index.date[0]
end = (index.date + timedelta(days=1))[0]
else:
start = index.date.min()
end = (index.date.max() + timedelta(days=1))
where = '(index >= "' + str(start) + '") & (index < "' str(end) + '")'
df = read_hdf(file, dataset, where=where)
df = df.groupby(level=0).last().reindex(index, method='pad')
return df
这是使用 where mask
的示例In [22]: pd.set_option('max_rows',10)
In [23]: df = DataFrame({'A' : np.random.randn(100), 'B' : pd.date_range('20130101',periods=100)}).set_index('B')
In [24]: df
Out[24]:
A
B
2013-01-01 0.493144
2013-01-02 0.421045
2013-01-03 -0.717824
2013-01-04 0.159865
2013-01-05 -0.485890
... ...
2013-04-06 -0.805954
2013-04-07 -1.014333
2013-04-08 0.846877
2013-04-09 -1.646908
2013-04-10 -0.160927
[100 rows x 1 columns]
存储测试帧
In [25]: store = pd.HDFStore('test.h5',mode='w')
In [26]: store.append('df',df)
随机选择日期。
In [27]: dates = df.index.take(np.random.randint(0,100,10))
In [28]: dates
Out[28]: DatetimeIndex(['2013-03-29', '2013-02-16', '2013-01-15', '2013-02-06', '2013-01-12', '2013-02-24', '2013-02-18', '2013-01-06', '2013-03-17', '2013-03-21'], dtype='datetime64[ns]', name=u'B', freq=None, tz=None)
Select 索引列(完整)
In [29]: c = store.select_column('df','index')
In [30]: c
Out[30]:
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
4 2013-01-05
...
95 2013-04-06
96 2013-04-07
97 2013-04-08
98 2013-04-09
99 2013-04-10
Name: B, dtype: datetime64[ns]
Select 您想要的索引器。这实际上可能有些复杂,例如你可能想要 .reindex(method='nearest')
In [34]: c[c.isin(dates)]
Out[34]:
5 2013-01-06
11 2013-01-12
14 2013-01-15
36 2013-02-06
46 2013-02-16
48 2013-02-18
54 2013-02-24
75 2013-03-17
79 2013-03-21
87 2013-03-29
Name: B, dtype: datetime64[ns]
Select 您想要的行
In [32]: store.select('df',where=c[c.isin(dates)].index)
Out[32]:
A
B
2013-01-06 0.680930
2013-01-12 0.165923
2013-01-15 -0.517692
2013-02-06 -0.351020
2013-02-16 1.348973
2013-02-18 0.448890
2013-02-24 -1.078522
2013-03-17 -0.358597
2013-03-21 -0.482301
2013-03-29 0.343381
In [33]: store.close()