使用 data_columns 时在 HDF 中查询多索引 table 的问题
Problems with querying multiindex table in HDF when using data_columns
我尝试在 pandas HDF 存储中查询多索引 table,但在同时使用索引和 data_columns 查询时失败。这仅在 data_columns=True
时发生。知道这是否是预期的,或者如果我不想明确指定 data_columns?
如何避免
看下面的例子,它似乎没有将索引识别为有效引用:
import pandas as pd
import numpy as np
file_path = 'D:\test_store.h5'
np.random.seed(1234)
pd.set_option('display.max_rows',4)
# simulate some data
index = pd.MultiIndex.from_product([np.arange(10000,10200),
pd.date_range('19800101',periods=500)],
names=['id','date'])
df = pd.DataFrame(dict(id2=np.random.randint(0, 1000, size=len(index)),
w=np.random.randn(len(index))),
index=index).reset_index().set_index(['id', 'date'])
# store the data
store = pd.HDFStore(file_path,mode='a',complib='blosc', complevel=9)
store.append('df_dc_None', df, data_columns=None)
store.append('df_dc_explicit', df, data_columns=['id2', 'w'])
store.append('df_dc_True', df, data_columns=True)
store.close()
# query the data
start = '19810201'
print(pd.read_hdf(file_path,'df_dc_None', where='date>start & id=10000'))
print(pd.read_hdf(file_path,'df_dc_True', where='id2>500'))
print(pd.read_hdf(file_path,'df_dc_explicit', where='date>start & id2>500'))
try:
print(pd.read_hdf(file_path,'df_dc_True', where='date>start & id2>500'))
except ValueError as err:
print(err)
这确实是个有趣的问题!
我无法解释以下差异(为什么我们在使用 data_columns=None
时对索引列进行了索引(默认是由于 HDFStore.append
方法的 docstring
)并且我们不在使用 data_columns=True
时将它们编入索引):
In [114]: store.get_storer('df_dc_None').table
Out[114]:
/df_dc_None/table (Table(100000,), shuffle, blosc(9)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Int32Col(shape=(1,), dflt=0, pos=1),
"values_block_1": Float64Col(shape=(1,), dflt=0.0, pos=2),
"date": Int64Col(shape=(), dflt=0, pos=3),
"id": Int64Col(shape=(), dflt=0, pos=4)}
byteorder := 'little'
chunkshape := (1820,)
autoindex := True
colindexes := {
"date": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"id": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
In [115]: store.get_storer('df_dc_True').table
Out[115]:
/df_dc_True/table (Table(100000,), shuffle, blosc(9)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Int64Col(shape=(1,), dflt=0, pos=1),
"values_block_1": Int64Col(shape=(1,), dflt=0, pos=2),
"id2": Int32Col(shape=(), dflt=0, pos=3),
"w": Float64Col(shape=(), dflt=0.0, pos=4)}
byteorder := 'little'
chunkshape := (1820,)
autoindex := True
colindexes := {
"w": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"id2": Index(6, medium, shuffle, zlib(1)).is_csi=False}
注意:注意上面输出中的 colindexes
。
但是使用下面的简单 hack 我们可以 "fix" 这个:
In [116]: store.append('df_dc_all', df, data_columns=df.head(1).reset_index().columns)
In [118]: store.get_storer('df_dc_all').table
Out[118]:
/df_dc_all/table (Table(100000,), shuffle, blosc(9)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"id": Int64Col(shape=(), dflt=0, pos=1),
"date": Int64Col(shape=(), dflt=0, pos=2),
"id2": Int32Col(shape=(), dflt=0, pos=3),
"w": Float64Col(shape=(), dflt=0.0, pos=4)}
byteorder := 'little'
chunkshape := (1820,)
autoindex := True
colindexes := {
"w": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"date": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"id": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"id2": Index(6, medium, shuffle, zlib(1)).is_csi=False}
检查:
In [119]: pd.read_hdf(file_path,'df_dc_all', where='date>start & id2>500')
Out[119]:
id2 w
id date
10000 1981-02-02 935 0.245637
1981-02-04 994 0.291287
... ... ...
10199 1981-05-11 680 -0.370745
1981-05-12 812 -0.880742
[10121 rows x 2 columns]
我尝试在 pandas HDF 存储中查询多索引 table,但在同时使用索引和 data_columns 查询时失败。这仅在 data_columns=True
时发生。知道这是否是预期的,或者如果我不想明确指定 data_columns?
看下面的例子,它似乎没有将索引识别为有效引用:
import pandas as pd
import numpy as np
file_path = 'D:\test_store.h5'
np.random.seed(1234)
pd.set_option('display.max_rows',4)
# simulate some data
index = pd.MultiIndex.from_product([np.arange(10000,10200),
pd.date_range('19800101',periods=500)],
names=['id','date'])
df = pd.DataFrame(dict(id2=np.random.randint(0, 1000, size=len(index)),
w=np.random.randn(len(index))),
index=index).reset_index().set_index(['id', 'date'])
# store the data
store = pd.HDFStore(file_path,mode='a',complib='blosc', complevel=9)
store.append('df_dc_None', df, data_columns=None)
store.append('df_dc_explicit', df, data_columns=['id2', 'w'])
store.append('df_dc_True', df, data_columns=True)
store.close()
# query the data
start = '19810201'
print(pd.read_hdf(file_path,'df_dc_None', where='date>start & id=10000'))
print(pd.read_hdf(file_path,'df_dc_True', where='id2>500'))
print(pd.read_hdf(file_path,'df_dc_explicit', where='date>start & id2>500'))
try:
print(pd.read_hdf(file_path,'df_dc_True', where='date>start & id2>500'))
except ValueError as err:
print(err)
这确实是个有趣的问题!
我无法解释以下差异(为什么我们在使用 data_columns=None
时对索引列进行了索引(默认是由于 HDFStore.append
方法的 docstring
)并且我们不在使用 data_columns=True
时将它们编入索引):
In [114]: store.get_storer('df_dc_None').table
Out[114]:
/df_dc_None/table (Table(100000,), shuffle, blosc(9)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Int32Col(shape=(1,), dflt=0, pos=1),
"values_block_1": Float64Col(shape=(1,), dflt=0.0, pos=2),
"date": Int64Col(shape=(), dflt=0, pos=3),
"id": Int64Col(shape=(), dflt=0, pos=4)}
byteorder := 'little'
chunkshape := (1820,)
autoindex := True
colindexes := {
"date": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"id": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
In [115]: store.get_storer('df_dc_True').table
Out[115]:
/df_dc_True/table (Table(100000,), shuffle, blosc(9)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Int64Col(shape=(1,), dflt=0, pos=1),
"values_block_1": Int64Col(shape=(1,), dflt=0, pos=2),
"id2": Int32Col(shape=(), dflt=0, pos=3),
"w": Float64Col(shape=(), dflt=0.0, pos=4)}
byteorder := 'little'
chunkshape := (1820,)
autoindex := True
colindexes := {
"w": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"id2": Index(6, medium, shuffle, zlib(1)).is_csi=False}
注意:注意上面输出中的 colindexes
。
但是使用下面的简单 hack 我们可以 "fix" 这个:
In [116]: store.append('df_dc_all', df, data_columns=df.head(1).reset_index().columns)
In [118]: store.get_storer('df_dc_all').table
Out[118]:
/df_dc_all/table (Table(100000,), shuffle, blosc(9)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"id": Int64Col(shape=(), dflt=0, pos=1),
"date": Int64Col(shape=(), dflt=0, pos=2),
"id2": Int32Col(shape=(), dflt=0, pos=3),
"w": Float64Col(shape=(), dflt=0.0, pos=4)}
byteorder := 'little'
chunkshape := (1820,)
autoindex := True
colindexes := {
"w": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"date": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"id": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"id2": Index(6, medium, shuffle, zlib(1)).is_csi=False}
检查:
In [119]: pd.read_hdf(file_path,'df_dc_all', where='date>start & id2>500')
Out[119]:
id2 w
id date
10000 1981-02-02 935 0.245637
1981-02-04 994 0.291287
... ... ...
10199 1981-05-11 680 -0.370745
1981-05-12 812 -0.880742
[10121 rows x 2 columns]