如何逐步将大量数据写入内存？

Question

我正在对大型图像数据集执行信号处理任务，将图像转换为具有特定结构的大型特征向量(number_of_transforms, width, height, depth)。

特征向量（或我的代码中的 coefficients）太大，无法一次全部保存在内存中，所以我尝试将它们写入 np.mmap，如下所示：

coefficients = np.memmap(
    output_location, dtype=np.float32, mode="w+",
    shape=(n_samples, number_of_transforms, width, height, depth))

for n in range(n_samples):
    image = images[n]
    coefficients_sample = transform(images[n])
    coefficients[n, :, :, :, :] = coefficients_sample

这适用于我的目的，但有一个缺点：如果我想稍后加载某个 "run" 的系数（transform 必须使用不同的超参数进行测试）分析，我必须以某种方式重建原始形状 (number_of_transforms, width, height, depth)，这肯定会变得混乱。

是否有更简洁（最好是与 numpy 兼容）的方式，允许我保留 transform 特征向量的结构和数据类型，同时仍然间歇性地将 transform 的结果写入磁盘？

Answer 1

正如@juanpa.arrivillaga 所指出的，唯一需要做的改变是使用 numpy.lib.format.open_memmap 而不是 np.memmap:

coefficients = numpy.lib.format.open_memmap(
    output_location, dtype=np.float32, mode="w+",
    shape=(n_samples, number_of_transforms, width, height, depth))

稍后，像这样检索数据（具有正确的形状和数据类型）：

coefficients = numpy.lib.format.open_memmap(output_location)

如何逐步将大量数据写入内存？

How to gradually write large amounts of data to memory?

python

numpy

python-3.x

numpy-memmap