随机访问保存在磁盘上的 numpy 数组

Question

我有一个大的 numpy 数组 A，形状为 (2_000_000, 2000) of dtype float64，需要 32 GB。

（或者同样的数据拆分成10个形状为（200_000, 2000）的数组，序列化可能更容易？）。

我们如何将它序列化到磁盘，以便我们可以快速随机读取数据的任何部分？

更准确地说，我需要能够从 A 在随机起始索引 i:[=18= 处读取数万个 windows 形状 (16, 2 000) ]

L = []
for i in range(10_000):
    i = random.randint(0, 2_000_000 - 16):
    window = A[i:i+16, :]         # window of A of shape (16, 2000) starting at a random index i
    L.append(window)
WINS = np.concatenate(L)   # shape (10_000, 16, 2000) of float64, ie: ~ 2.4 GB

假设我只有 8 GB 的 RAM 可用于此任务；在 RAM 中加载整个 32 GB A 是完全不可能的。

我们如何在磁盘序列化的 numpy 数组中读取这样的 windows？（.h5 格式或任何其他格式）

注意：在随机起始索引处进行读取这一事实很重要。

Answer 1

此示例展示了如何将 HDF5 文件用于您描述的过程。

首先，创建一个包含 shape(2_000_000, 2000) 和 dtype=float64 值数据集的 HDF5 文件。我为尺寸使用了变量，因此您可以修改它。

import numpy as np
import h5py
import random

h5_a0, h5_a1 = 2_000_000, 2_000

with h5py.File('SO_68206763.h5','w') as h5f:
    dset = h5f.create_dataset('test',shape=(h5_a0, h5_a1))
    
    incr = 1_000
    a0 = h5_a0//incr
    for i in range(incr):
        arr = np.random.random(a0*h5_a1).reshape(a0,h5_a1)
        dset[i*a0:i*a0+a0, :] = arr       
    print(dset[-1,0:10])  # quick dataset check of values in last row

接下来，以读取模式打开文件，读取 10_000 个形状为 (16,2_000) 的随机数组切片并附加到列表 L。最后，将列表转换为数组WINS。请注意，默认情况下，数组将有 2 个轴——如果您希望每个评论有 3 个轴，则需要使用 .reshape()（也显示了重塑）。

with h5py.File('SO_68206763.h5','r') as h5f:
    dset = h5f['test']
    L = []
    ds0, ds1 = dset.shape[0], dset.shape[1]
    for i in range(10_000):
        ir = random.randint(0, ds0 - 16)
        window = dset[ir:ir+16, :]  # window from dset of shape (16, 2000) starting at a random index i
        L.append(window)
    WINS = np.concatenate(L)   # shape (160_000, 2_000) of float64,
    print(WINS.shape, WINS.dtype)
    WINS = np.concatenate(L).reshape(10_0000,16,ds1)   # reshaped to (10_000, 16, 2_000) of float64
    print(WINS.shape, WINS.dtype)

上述过程内存效率不高。您最终得到随机切片数据的 2 个副本：在列表 L 和数组 WINS 中。如果内存有限，这可能是个问题。为避免中间副本，将数据的随机幻灯片直接读取到数组中。这样做可以简化代码并减少内存占用。该方法如下所示（WINS2 是 2 轴阵列，WINS3 是 3 轴阵列）。

with h5py.File('SO_68206763.h5','r') as h5f:
    dset = h5f['test']
    ds0, ds1 = dset.shape[0], dset.shape[1]
    WINS2 = np.empty((10_000*16,ds1))
    WINS3 = np.empty((10_000,16,ds1))
    for i in range(10_000):
        ir = random.randint(0, ds0 - 16)
        WINS2[i*16:(i+1)*16,:] = dset[ir:ir+16, :]
        WINS3[i,:,:] = dset[ir:ir+16, :]

Answer 2

h5py 数据集的另一种解决方案是使用 memmap，我尝试过并有效，正如@RyanPepper 的评论中所建议的那样。

将数据写入二进制

import numpy as np
with open('a.bin', 'wb') as A:
    for f in range(1000):
        x =  np.random.randn(10*2000).astype('float32').reshape(10, 2000)
        A.write(x.tobytes())
        A.flush()

稍后打开 `memmap`

A = np.memmap('a.bin', dtype='float32', mode='r').reshape((-1, 2000))
print(A.shape)  # (10000, 2000)
print(A[1234:1234+16, :])  # window

随机访问保存在磁盘上的 numpy 数组

Random access in a saved-on-disk numpy array

python

arrays

numpy

hdf5

numpy-memmap

将数据写入二进制

稍后打开 `memmap`

随机访问保存在磁盘上的 numpy 数组

Random access in a saved-on-disk numpy array

python

arrays

numpy

hdf5

numpy-memmap

将数据写入二进制

稍后打开 memmap

稍后打开 `memmap`