使用 HDF5 (Python, PyTables) 保存大型数组时内存不足

Question

大家好，

我有一个生成矩阵的 python 进程。它们相互叠加并保存为张量。这是代码

import tables
h5file = tables.open_file("data/tensor.h5", mode="w", title="tensor")
atom = tables.Atom.from_dtype(n.dtype('int16'))
tensor_shape = (N, 3, MAT_SIZE, MAT_SIZE)

for i in range(N):
    mat = generate(i)
    tensor[i, :, :] = mat

问题是当达到 8GB 时内存不足。 HDF5 格式不应该永远不会内存不足吗？比如在需要时将数据从内存移动到磁盘？

Answer 1

当您使用 PyTables 时，HDF5 文件会一直保存在内存中，直到文件关闭（查看更多信息：In-memory HDF5 files）。

我建议您看一下 PyTables 的 append 和 flush 方法，因为我认为这正是您想要的。请注意，由于需要执行常量 I/O，因此每次循环迭代都刷新缓冲区会显着降低代码的性能。

同时将文件写入块（就像在 pandas 中将数据读入数据帧时一样）可能会激起您的兴趣 - 在此处查看更多信息：PyTables optimization

使用 HDF5 (Python, PyTables) 保存大型数组时内存不足

goes out of memory when saving large array with HDF5 (Python, PyTables)

python

hdf5

bigdata

pytables