关于 Pandas 多索引 HDFStore 的磁盘索引
On disk indexing of Pandas multiindexed HDFStore
为了提高性能并减少内存占用,我正在尝试读取在 Pandas 中创建的多索引 HDFStore。原来的商店很大,但问题可以用一个类似但更小的例子重现。
df = pd.DataFrame([0.25, 0.5, 0.75, 1.0],
index=['Item0', 'Item1', 'Item2', 'Item3'], columns=['Values'])
df = pd.concat((df.iloc[:],df.iloc[:]), axis=0,names=['Item','N'],
keys = ['Items0','Items1'])
df.to_hdf('hdfs.h5', 'df', format='table',mode='w',complevel= 9,complib='blosc',data_columns=True)
store = pd.HDFStore('hdfs.h5', mode= 'r')
store.select('df',where='Item="Items0"')
这应该是 return 子索引的值,但是它 return 是一个错误
> ValueError: The passed where expression: Item="Items0"
> contains an invalid variable reference
> all of the variable refrences must be a reference to
> an axis (e.g. 'index' or 'columns'), or a data_column
> The currently defined references are: index,iron,columns
指数是:
store['df'].index
> MultiIndex(levels=[['Items0', 'Items1'], ['Item0', 'Item1', 'Item2',
> 'Item3']],
> labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 3, 0, 1, 2, 3]],
> names=['Item', 'N'])
谁能解释一下可能是什么原因?或者应该如何正确完成...
对我来说,如果删除 data_columns=True
:
df.to_hdf('hdfs3.h5', 'df', format='table',mode='w',complevel= 9,complib='blosc')
store = pd.HDFStore('hdfs3.h5', mode= 'r')
print (store.select('df','Item="Items0"'))
Values
Item N
Items0 Item0 0.25
Item1 0.50
Item2 0.75
Item3 1.00
尝试将 data_columns=True
替换为 data_columns=df.columns.tolist()
。
演示:
原始 MultiIndex DF:
In [2]: df
Out[2]:
Values
Item N
Items0 Item0 0.25
Item1 0.50
Item2 0.75
Item3 1.00
Items1 Item0 0.25
Item1 0.50
Item2 0.75
Item3 1.00
使用 data_columns=df.columns.tolist()
:
将其保存到 HDF5
In [3]: df.to_hdf('c:/temp/hdfs.h5','df',format='t',mode='w',complevel=9,complib='blosc',data_columns=df.columns.tolist())
In [4]: df.columns.tolist()
Out[4]: ['Values']
从 HDF 商店中选择:
In [5]: store = pd.HDFStore('c:/temp/hdfs.h5')
索引级别和 Values
列现在都已编入索引,可以在 where=<query>
参数中使用:
In [6]: store.select('df',where='Item="Items0" and Values in [0.5, 1]')
Out[6]:
Values
Item N
Items0 Item1 0.5
Item3 1.0
In [7]: store.select('df',where='N="Item3" and Values in [0.5, 1]')
Out[7]:
Values
Item N
Items0 Item3 1.0
Items1 Item3 1.0
店家信息:
In [8]: store.get_storer('df').table
Out[8]:
/df/table (Table(8,), shuffle, blosc(9)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"N": StringCol(itemsize=5, shape=(), dflt=b'', pos=1),
"Item": StringCol(itemsize=6, shape=(), dflt=b'', pos=2),
"Values": Float64Col(shape=(), dflt=0.0, pos=3)}
byteorder := 'little'
chunkshape := (2427,)
autoindex := True
colindexes := {
"Values": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"Item": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"N": Index(6, medium, shuffle, zlib(1)).is_csi=False}
存储索引级别:
In [9]: store.get_storer('df').levels
Out[9]: ['Item', 'N']
注意: 如果您只是省略 data_columns
参数,那么只有索引会在 HDF 存储中建立索引,所有其他列将不可搜索:
演示:
In [19]: df.to_hdf('c:/temp/NO_data_columns.h5', 'df', format='t',mode='w',complevel=9,complib='blosc')
In [20]: store = pd.HDFStore('c:/temp/NO_data_columns.h5')
In [21]: store.select('df',where='N == "Item3"')
Out[21]:
Values
Item N
Items0 Item3 1.0
Items1 Item3 1.0
In [22]: store.select('df',where='N == "Item3" and Values == 1')
---------------------------------------------------------------------------
...
skipped
...
ValueError: The passed where expression: N == "Item3" and Values == 1
contains an invalid variable reference
all of the variable refrences must be a reference to
an axis (e.g. 'index' or 'columns'), or a data_column
The currently defined references are: N,index,Item,columns
更新:
What is the real difference in putting
data_columns=df.columns.tolist() ?
In [18]: fn = r'd:/temp/a.h5'
In [19]: df.to_hdf(fn,'dc_true',data_columns=True,format='t',mode='w',complevel=9,complib='blosc')
In [20]: df.to_hdf(fn,'dc_cols',data_columns=df.columns.tolist(),format='t',complevel=9,complib='blosc')
In [21]: store = pd.HDFStore(fn)
In [22]: store
Out[22]:
<class 'pandas.io.pytables.HDFStore'>
File path: d:/temp/a.h5
/dc_cols frame_table (typ->appendable_multi,nrows->8,ncols->3,indexers->[index],dc->[N,Item,Values])
/dc_true frame_table (typ->appendable_multi,nrows->8,ncols->3,indexers->[index],dc->[Values])
In [23]: store.get_storer('dc_true').table.colindexes
Out[23]:
{
"Values": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
In [24]: store.get_storer('dc_cols').table.colindexes
Out[24]:
{
"Item": Index(6, medium, shuffle, zlib(1)).is_csi=False, # <- missing when `data_columns=True`
"N": Index(6, medium, shuffle, zlib(1)).is_csi=False, # <- missing when `data_columns=True`
"Values": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
所以区别在于索引列的索引方式
为了提高性能并减少内存占用,我正在尝试读取在 Pandas 中创建的多索引 HDFStore。原来的商店很大,但问题可以用一个类似但更小的例子重现。
df = pd.DataFrame([0.25, 0.5, 0.75, 1.0],
index=['Item0', 'Item1', 'Item2', 'Item3'], columns=['Values'])
df = pd.concat((df.iloc[:],df.iloc[:]), axis=0,names=['Item','N'],
keys = ['Items0','Items1'])
df.to_hdf('hdfs.h5', 'df', format='table',mode='w',complevel= 9,complib='blosc',data_columns=True)
store = pd.HDFStore('hdfs.h5', mode= 'r')
store.select('df',where='Item="Items0"')
这应该是 return 子索引的值,但是它 return 是一个错误
> ValueError: The passed where expression: Item="Items0"
> contains an invalid variable reference
> all of the variable refrences must be a reference to
> an axis (e.g. 'index' or 'columns'), or a data_column
> The currently defined references are: index,iron,columns
指数是:
store['df'].index
> MultiIndex(levels=[['Items0', 'Items1'], ['Item0', 'Item1', 'Item2',
> 'Item3']],
> labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 3, 0, 1, 2, 3]],
> names=['Item', 'N'])
谁能解释一下可能是什么原因?或者应该如何正确完成...
对我来说,如果删除 data_columns=True
:
df.to_hdf('hdfs3.h5', 'df', format='table',mode='w',complevel= 9,complib='blosc')
store = pd.HDFStore('hdfs3.h5', mode= 'r')
print (store.select('df','Item="Items0"'))
Values
Item N
Items0 Item0 0.25
Item1 0.50
Item2 0.75
Item3 1.00
尝试将 data_columns=True
替换为 data_columns=df.columns.tolist()
。
演示:
原始 MultiIndex DF:
In [2]: df
Out[2]:
Values
Item N
Items0 Item0 0.25
Item1 0.50
Item2 0.75
Item3 1.00
Items1 Item0 0.25
Item1 0.50
Item2 0.75
Item3 1.00
使用 data_columns=df.columns.tolist()
:
In [3]: df.to_hdf('c:/temp/hdfs.h5','df',format='t',mode='w',complevel=9,complib='blosc',data_columns=df.columns.tolist())
In [4]: df.columns.tolist()
Out[4]: ['Values']
从 HDF 商店中选择:
In [5]: store = pd.HDFStore('c:/temp/hdfs.h5')
索引级别和 Values
列现在都已编入索引,可以在 where=<query>
参数中使用:
In [6]: store.select('df',where='Item="Items0" and Values in [0.5, 1]')
Out[6]:
Values
Item N
Items0 Item1 0.5
Item3 1.0
In [7]: store.select('df',where='N="Item3" and Values in [0.5, 1]')
Out[7]:
Values
Item N
Items0 Item3 1.0
Items1 Item3 1.0
店家信息:
In [8]: store.get_storer('df').table
Out[8]:
/df/table (Table(8,), shuffle, blosc(9)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"N": StringCol(itemsize=5, shape=(), dflt=b'', pos=1),
"Item": StringCol(itemsize=6, shape=(), dflt=b'', pos=2),
"Values": Float64Col(shape=(), dflt=0.0, pos=3)}
byteorder := 'little'
chunkshape := (2427,)
autoindex := True
colindexes := {
"Values": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"Item": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"N": Index(6, medium, shuffle, zlib(1)).is_csi=False}
存储索引级别:
In [9]: store.get_storer('df').levels
Out[9]: ['Item', 'N']
注意: 如果您只是省略 data_columns
参数,那么只有索引会在 HDF 存储中建立索引,所有其他列将不可搜索:
演示:
In [19]: df.to_hdf('c:/temp/NO_data_columns.h5', 'df', format='t',mode='w',complevel=9,complib='blosc')
In [20]: store = pd.HDFStore('c:/temp/NO_data_columns.h5')
In [21]: store.select('df',where='N == "Item3"')
Out[21]:
Values
Item N
Items0 Item3 1.0
Items1 Item3 1.0
In [22]: store.select('df',where='N == "Item3" and Values == 1')
---------------------------------------------------------------------------
...
skipped
...
ValueError: The passed where expression: N == "Item3" and Values == 1
contains an invalid variable reference
all of the variable refrences must be a reference to
an axis (e.g. 'index' or 'columns'), or a data_column
The currently defined references are: N,index,Item,columns
更新:
What is the real difference in putting data_columns=df.columns.tolist() ?
In [18]: fn = r'd:/temp/a.h5'
In [19]: df.to_hdf(fn,'dc_true',data_columns=True,format='t',mode='w',complevel=9,complib='blosc')
In [20]: df.to_hdf(fn,'dc_cols',data_columns=df.columns.tolist(),format='t',complevel=9,complib='blosc')
In [21]: store = pd.HDFStore(fn)
In [22]: store
Out[22]:
<class 'pandas.io.pytables.HDFStore'>
File path: d:/temp/a.h5
/dc_cols frame_table (typ->appendable_multi,nrows->8,ncols->3,indexers->[index],dc->[N,Item,Values])
/dc_true frame_table (typ->appendable_multi,nrows->8,ncols->3,indexers->[index],dc->[Values])
In [23]: store.get_storer('dc_true').table.colindexes
Out[23]:
{
"Values": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
In [24]: store.get_storer('dc_cols').table.colindexes
Out[24]:
{
"Item": Index(6, medium, shuffle, zlib(1)).is_csi=False, # <- missing when `data_columns=True`
"N": Index(6, medium, shuffle, zlib(1)).is_csi=False, # <- missing when `data_columns=True`
"Values": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
所以区别在于索引列的索引方式