hdf 到 numpy 中的 ndarray - 快速方式

Question

我正在寻找一种快速方法来将我的 hdf 文件集合设置为一个 numpy 数组，其中每一行都是图像的扁平化版本。我的意思是：

除了其他信息外，我的 hdf 文件还存储每帧图像。每个文件包含 51 帧和 512x424 图像。现在我有 300 多个 hdf 文件，我希望将图像像素存储为每帧一个向量，其中所有图像的所有帧都存储在一个 numpy ndarray 中。下图应该有助于理解：

到目前为止我得到的是一个非常慢的方法，我实际上不知道如何才能让它更快。问题是，据我所知，我的最终数组被调用得太频繁了。因为我观察到第一个文件加载到数组中的速度非常快，但速度下降得很快。（通过打印当前hdf文件的编号观察）

我当前的代码：

os.chdir(os.getcwd()+"\datasets")

# predefine first row to use vstack later
numpy_data = np.ndarray((1,217088))

# search for all .hdf files
for idx, file in enumerate(glob.glob("*.hdf5")):
  f = h5py.File(file, 'r')
  # load all img data to imgs (=ndarray, but not flattened)
  imgs = f['img']['data'][:]

  # iterate over all frames (50)
  for frame in range(0, imgs.shape[0]):
    print("processing {}/{} (file/frame)".format(idx+1,frame+1))
    data = np.array(imgs[frame].flatten())
    numpy_data = np.vstack((numpy_data, data))

    # delete first row after another is one is stored
    if idx == 0 and frame == 0:
        numpy_data = np.delete(numpy_data, 0,0)

f.close()

有关更多信息，我需要这个来学习决策树。由于我的 hdf 文件比我的 RAM 大，我认为转换成 numpy 数组可以节省内存，因此更适合。

感谢您的每一次投入。

Answer 1

我认为您不需要迭代

imgs = f['img']['data'][:]

并重塑每个二维数组。只是重塑整个事情。如果我没看错你的描述，imgs 是一个 3d 数组：(51, 512, 424)

imgs.reshape(51, 512*424)

应该是 2d 等价物。

如果必须循环，请不要使用 vstack（或构建更大数组的某些变体）。第一，它很慢，第二，清理初始 'dummy' 条目很痛苦。使用list appends，最后做一次stacking

alist = []
for frame....
   alist.append(data)
data_array = np.vstack(alist)

vstack（和系列）将数组列表作为输入，因此它可以同时处理多个数组。迭代完成时，列表追加要快得多。

我怀疑将东西放在一个数组中是否有帮助。我不知道 hdf5 文件的大小与下载数组的大小有何关系，但我希望它们处于相同的数量级。因此，尝试将所有 300 个文件加载到内存中可能行不通。那是什么，3G像素？

对于单个文件，h5py 提供了加载太大而无法放入内存的数组块的规定。这表明问题通常是相反的，文件容纳的太多了。

Is it possible to load large data directly into numpy int8 array using h5py?

Answer 2

您真的不想将所有图像加载到 RAM 中而不使用单个 HDF5 文件吗？如果您没有犯任何错误（不必要的花式索引、不正确的块缓存大小），访问 HDF5 文件会非常快。如果你不想使用 numpy 方式，这将是一种可能性：

os.chdir(os.getcwd()+"\datasets")
img_per_file=51

# get all HDF5-Files
files=[]
for idx, file in enumerate(glob.glob("*.hdf5")):
    files.append(file)

# allocate memory for your final Array (change the datatype if your images have some other type)
numpy_data=np.empty((len(files)*img_per_file,217088),dtype=np.uint8)

# Now read all the data
ii=0
for i in range(0,len(files)):
    f = h5py.File(files[0], 'r')
    imgs = f['img']['data'][:]
    f.close()
    numpy_data[ii:ii+img_per_file,:]=imgs.reshape((img_per_file,217088))
    ii=ii+img_per_file

将数据写入单个 HDF5 文件非常相似：

f_out=h5py.File(File_Name_HDF5_out,'w')
# create the dataset (change the datatype if your images have some other type)
dset_out = f_out.create_dataset(Dataset_Name_out, ((len(files)*img_per_file,217088), chunks=(1,217088),dtype='uint8')

# Now read all the data
ii=0
for i in range(0,len(files)):
    f = h5py.File(files[0], 'r')
    imgs = f['img']['data'][:]
    f.close()
    dset_out[ii:ii+img_per_file,:]=imgs.reshape((img_per_file,217088))
    ii=ii+img_per_file

f_out.close()

如果您只是不想事后访问整个图像，块大小应该没问题。如果不是，则必须根据需要进行更改。

访问 HDF5 文件时应该做什么：

使用适合您需要的块大小。
设置合适的 chunk-chache-size。这可以通过 h5py 低级别 api 或 h5py_cache 来完成。 https://pypi.python.org/pypi/h5py-cache/1.0

避免任何类型的花哨索引。如果您的数据集有 n 维，则以返回的数组也有 n 维的方式访问它。

# Chunk size is [50,50] and we iterate over the first dimension
numpyArray=h5_dset[i,:] #slow
numpyArray=np.squeeze(h5_dset[i:i+1,:]) #does the same but is much faster

编辑这显示了如何将数据读取到内存映射的 numpy 数组。我认为您的方法需要 np.float32 格式的数据。 https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html#numpy.memmap

 numpy_data = np.memmap('Your_Data.npy', dtype='np.float32', mode='w+', shape=((len(files)*img_per_file,217088)))

其他一切都可以保持不变。如果可行，我还建议使用 SSD 而不是硬盘。

hdf 到 numpy 中的 ndarray - 快速方式

hdf to ndarray in numpy - fast way

python

numpy

hdf5

h5py