将“dask.array”保存为 hdf5 数据集

Question

我有一个 dask.array 跨越多个 hdf5 文件。基本上，我想做的是对数据集进行切片并将生成的切片存储到 hdf5。到目前为止我尝试过的基本上是这样的：

In [1]: import dask.array as da

In [3]: import numpy as np

In [5]: xs = da.from_array(np.linspace(0, 10), chunks=10) # could be from hdf5 files

In [7]: import h5py

In [8]: h5f = h5py.File('/tmp/paul/foo.h5')

In [9]: h5f.create_dataset(name='ham', data=xs)
Out[9]: <HDF5 dataset "ham": shape (50,), type "<f8">

效果很好。但是，当我 da.concatenate 多个 h5py 数据集时，create_dataset 函数似乎冻结（线程死锁？）。请注意，xs 可能是一个（大致）10 GB 的数据集，跨越 10 个文件，每个文件 1 GB。

什么是将 xs 写入 h5py 数据集而不诉诸 da.compute 并冒 MemoryError 风险的明智方法？

Answer 1

我怀疑 h5py 库正在将您的 dask 数组转换为内存中的 numpy 数组，这可能不是您想要的。

相反，您可能需要存储功能（参见 this section in the documentation）

f = h5py.File('myfile.hdf5')
d = f.require_dataset('/data', shape=x.shape, dtype=x.dtype)
da.store(x, d)

您可能还需要 to_hdf5 方法（参见 this section in the documentation）

da.to_hdf5('myfile.hdf5', '/x', x)

您应该注意适当地分块您的 HDF5 数据集，以便它与您的 dask.array 分块对齐。如果您不想自己考虑，to_hdf5 方法会为您处理。

将“dask.array”保存为 hdf5 数据集

Saving `dask.array` as hdf5 dataset

python

numpy

dask