h5py:如何读取 hdf5 文件的选定行?
h5py: how to read selected rows of an hdf5 file?
是否可以在不加载整个文件的情况下从 hdf5 文件中读取一组给定的行?我有相当大的 hdf5 文件和大量数据集,这里是我想减少时间和内存使用的示例:
#! /usr/bin/env python
import numpy as np
import h5py
infile = 'field1.87.hdf5'
f = h5py.File(infile,'r')
group = f['Data']
mdisk = group['mdisk'].value
val = 2.*pow(10.,10.)
ind = np.where(mdisk>val)[0]
m = group['mcold'][ind]
print m
ind
不给出连续的行,而是分散的行。
上面的代码失败了,但它遵循了对 hdf5 数据集进行切片的标准方法。我收到的错误消息是:
Traceback (most recent call last):
File "./read_rows.py", line 17, in <module>
m = group['mcold'][ind]
File "/cosma/local/Python/2.7.3/lib/python2.7/site-packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/dataset.py", line 425, in __getitem__
selection = sel.select(self.shape, args, dsid=self.id)
File "/cosma/local/Python/2.7.3/lib/python2.7/site-packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/selections.py", line 71, in select
sel[arg]
File "/cosma/local/Python/2.7.3/lib/python2.7/site-packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/selections.py", line 209, in __getitem__
raise TypeError("PointSelection __getitem__ only works with bool arrays")
TypeError: PointSelection __getitem__ only works with bool arrays
我有一个示例 h5py 文件:
data = f['data']
# <HDF5 dataset "data": shape (3, 6), type "<i4">
# is arange(18).reshape(3,6)
ind=np.where(data[:]%2)[0]
# array([0, 0, 0, 1, 1, 1, 2, 2, 2], dtype=int32)
data[ind] # getitem only works with boolean arrays error
data[ind.tolist()] # can't read data (Dataset: Read failed) error
最后一个错误是由列表中的重复值引起的。
但是使用具有唯一值的列表进行索引工作正常
In [150]: data[[0,2]]
Out[150]:
array([[ 0, 1, 2, 3, 4, 5],
[12, 13, 14, 15, 16, 17]])
In [151]: data[:,[0,3,5]]
Out[151]:
array([[ 0, 3, 5],
[ 6, 9, 11],
[12, 15, 17]])
具有适当维度切片的数组也是如此:
In [157]: data[ind[[0,3,6]],:]
Out[157]:
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17]])
In [165]: f['data'][:2,np.array([0,3,5])]
Out[165]:
array([[ 0, 3, 5],
[ 6, 9, 11]])
In [166]: f['data'][[0,1],np.array([0,3,5])]
# errror about only one indexing array allowed
因此,如果索引是正确的 - 唯一值,并且匹配数组维度,它应该可以工作。
我的简单示例并未测试加载了多少数组。文档听起来好像是从文件中选择元素,而不是将整个数组加载到内存中。
是否可以在不加载整个文件的情况下从 hdf5 文件中读取一组给定的行?我有相当大的 hdf5 文件和大量数据集,这里是我想减少时间和内存使用的示例:
#! /usr/bin/env python
import numpy as np
import h5py
infile = 'field1.87.hdf5'
f = h5py.File(infile,'r')
group = f['Data']
mdisk = group['mdisk'].value
val = 2.*pow(10.,10.)
ind = np.where(mdisk>val)[0]
m = group['mcold'][ind]
print m
ind
不给出连续的行,而是分散的行。
上面的代码失败了,但它遵循了对 hdf5 数据集进行切片的标准方法。我收到的错误消息是:
Traceback (most recent call last):
File "./read_rows.py", line 17, in <module>
m = group['mcold'][ind]
File "/cosma/local/Python/2.7.3/lib/python2.7/site-packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/dataset.py", line 425, in __getitem__
selection = sel.select(self.shape, args, dsid=self.id)
File "/cosma/local/Python/2.7.3/lib/python2.7/site-packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/selections.py", line 71, in select
sel[arg]
File "/cosma/local/Python/2.7.3/lib/python2.7/site-packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/selections.py", line 209, in __getitem__
raise TypeError("PointSelection __getitem__ only works with bool arrays")
TypeError: PointSelection __getitem__ only works with bool arrays
我有一个示例 h5py 文件:
data = f['data']
# <HDF5 dataset "data": shape (3, 6), type "<i4">
# is arange(18).reshape(3,6)
ind=np.where(data[:]%2)[0]
# array([0, 0, 0, 1, 1, 1, 2, 2, 2], dtype=int32)
data[ind] # getitem only works with boolean arrays error
data[ind.tolist()] # can't read data (Dataset: Read failed) error
最后一个错误是由列表中的重复值引起的。
但是使用具有唯一值的列表进行索引工作正常
In [150]: data[[0,2]]
Out[150]:
array([[ 0, 1, 2, 3, 4, 5],
[12, 13, 14, 15, 16, 17]])
In [151]: data[:,[0,3,5]]
Out[151]:
array([[ 0, 3, 5],
[ 6, 9, 11],
[12, 15, 17]])
具有适当维度切片的数组也是如此:
In [157]: data[ind[[0,3,6]],:]
Out[157]:
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17]])
In [165]: f['data'][:2,np.array([0,3,5])]
Out[165]:
array([[ 0, 3, 5],
[ 6, 9, 11]])
In [166]: f['data'][[0,1],np.array([0,3,5])]
# errror about only one indexing array allowed
因此,如果索引是正确的 - 唯一值,并且匹配数组维度,它应该可以工作。
我的简单示例并未测试加载了多少数组。文档听起来好像是从文件中选择元素,而不是将整个数组加载到内存中。