从 HDF5 文件序列化和检索大量 numpy 数组的快速有效方法

Question

我有一个巨大的 numpy 数组列表，特别是 113287，其中每个数组的形状都是 36 x 2048。就内存而言，这相当于 32 GB。

到目前为止，我已经将这些数组序列化为一个巨大的 HDF5 文件。现在，问题是从这个 hdf5 文件中检索单个数组每次访问都需要非常长的时间（超过 10 分钟）。

我怎样才能加快速度？这对我的实现非常重要，因为我必须索引到此列表 数千次 才能输入深度神经网络。

以下是我如何索引到 hdf5 文件：

In [1]: import h5py
In [2]: hf = h5py.File('train_ids.hdf5', 'r')

In [5]: list(hf.keys())[0]
Out[5]: 'img_feats'

In [6]: group_key = list(hf.keys())[0]

In [7]: hf[group_key]
Out[7]: <HDF5 dataset "img_feats": shape (113287, 36, 2048), type "<f4">


# this is where it takes very very long time
In [8]: list(hf[group_key])[-1].shape
Out[8]: (36, 2048)

有什么可以加快速度的想法吗？有没有其他方法可以序列化这些数组以加快访问速度？

注意：我正在使用 Python 列表，因为我希望保留顺序（即按照我创建 hdf5 文件时放置的相同顺序进行检索）

Answer 1

一种方法是将每个样本放入其自己的组中并直接索引到这些组中。我认为转换需要很长时间，因为它试图将整个数据集加载到列表中（它必须从磁盘读取）。重新组织 h5 文件使得

组
- 样本
  - 36 x 2048 可能有助于提高索引速度。

Answer 2

根据Out[7]，"img_feats"是一个大的三维数组。 (113287, 36, 2048) 形状。

将 ds 定义为数据集（不加载任何内容）：

ds = hf[group_key]

x = ds[0]    # should be a (36, 2048) array

arr = ds[:]   # should load the whole dataset into memory.
arr = ds[:n]   # load a subset, slice

根据h5py-reading-writing-data：

HDF5 datasets re-use the NumPy slicing syntax to read and write to the file. Slice specifications are translated directly to HDF5 “hyperslab” selections, and are a fast and efficient way to access data in the file.

我认为将其包装在 list() 中没有任何意义；也就是说，将 3d 数组拆分为 113287 个 2d 数组的列表。 HDF5 文件上的 3d 数据集和 numpy 数组之间有一个清晰的映射。

h5py-fancy-indexing 警告数据集的奇特索引速度较慢。也就是说，寻求加载该大型数据集的 [1, 1000, 3000, 6000] 个子数组。

如果使用这个大数据集太混乱，您可能想尝试编写和读取一些较小的数据集。

从 HDF5 文件序列化和检索大量 numpy 数组的快速有效方法

Fast and efficient way of serializing and retrieving a large number of numpy arrays from HDF5 file

python

numpy

hdf5

h5py

numpy-ndarray