如何使用 h5py 将数据附加到 hdf5 文件中的一个特定数据集

Question

我正在寻找使用 Python (h5py) 将数据附加到 .h5 文件内现有数据集的可能性。

我的项目的简短介绍：我尝试使用医学图像数据训练 CNN。由于在将数据转换为 NumPy 数组的过程中数据量巨大且内存使用量大，我需要将 "transformation" 拆分为几个数据块：加载和预处理前 100 个医学图像并保存 NumPy 数组到 hdf5 文件，然后加载接下来的 100 个数据集并附加现有的 .h5 文件，依此类推。

现在，我尝试按如下方式存储前 100 个转换后的 NumPy 数组：

import h5py
from LoadIPV import LoadIPV

X_train_data, Y_train_data, X_test_data, Y_test_data = LoadIPV()

with h5py.File('.\PreprocessedData.h5', 'w') as hf:
    hf.create_dataset("X_train", data=X_train_data, maxshape=(None, 512, 512, 9))
    hf.create_dataset("X_test", data=X_test_data, maxshape=(None, 512, 512, 9))
    hf.create_dataset("Y_train", data=Y_train_data, maxshape=(None, 512, 512, 1))
    hf.create_dataset("Y_test", data=Y_test_data, maxshape=(None, 512, 512, 1))

可以看出，转换后的 NumPy 数组被分成四个不同的 "groups"，存储在四个 hdf5 数据集 [X_train, X_test, Y_train, Y_test] 中。 LoadIPV() 函数执行医学图像数据的预处理。

我的问题是我想将接下来的 100 个 NumPy 数组存储到现有数据集中的同一个 .h5 文件中：这意味着我想附加到现有的 X_train 形状 [100, 512, 512, 9] 的数据集与接下来的 100 个 NumPy 数组，这样 X_train 变成形状 [200, 512, 512, 9]。这同样适用于其他三个数据集 X_test、Y_train 和 Y_test.

Answer 1

我找到了一个似乎有效的解决方案！

看看这个：incremental writes to hdf5 with h5py!

为了将数据附加到特定数据集，必须首先调整相应轴中特定数据集的大小，然后将新数据附加到 "old" nparray 的末尾。

因此，解决方案如下所示：

with h5py.File('.\PreprocessedData.h5', 'a') as hf:
    hf["X_train"].resize((hf["X_train"].shape[0] + X_train_data.shape[0]), axis = 0)
    hf["X_train"][-X_train_data.shape[0]:] = X_train_data

    hf["X_test"].resize((hf["X_test"].shape[0] + X_test_data.shape[0]), axis = 0)
    hf["X_test"][-X_test_data.shape[0]:] = X_test_data

    hf["Y_train"].resize((hf["Y_train"].shape[0] + Y_train_data.shape[0]), axis = 0)
    hf["Y_train"][-Y_train_data.shape[0]:] = Y_train_data

    hf["Y_test"].resize((hf["Y_test"].shape[0] + Y_test_data.shape[0]), axis = 0)
    hf["Y_test"][-Y_test_data.shape[0]:] = Y_test_data

但是请注意，您应该使用 maxshape=(None,) 创建数据集，例如

h5f.create_dataset('X_train', data=orig_data, compression="gzip", chunks=True, maxshape=(None,))

否则无法扩展数据集。

Answer 2

@Midas.Inc 答案很有效。只是为感兴趣的人提供一个最小的工作示例：

import numpy as np
import h5py

f = h5py.File('MyDataset.h5', 'a')
for i in range(10):

  # Data to be appended
  new_data = np.ones(shape=(100,64,64)) * i
  new_label = np.ones(shape=(100,1)) * (i+1)

  if i == 0:
    # Create the dataset at first
    f.create_dataset('data', data=new_data, compression="gzip", chunks=True, maxshape=(None,64,64))
    f.create_dataset('label', data=new_label, compression="gzip", chunks=True, maxshape=(None,1)) 
  else:
    # Append new data to it
    f['data'].resize((f['data'].shape[0] + new_data.shape[0]), axis=0)
    f['data'][-new_data.shape[0]:] = new_data

    f['label'].resize((f['label'].shape[0] + new_label.shape[0]), axis=0)
    f['label'][-new_label.shape[0]:] = new_label

  print("I am on iteration {} and 'data' chunk has shape:{}".format(i,f['data'].shape))

f.close()

代码输出：

#I am on iteration 0 and 'data' chunk has shape:(100, 64, 64)
#I am on iteration 1 and 'data' chunk has shape:(200, 64, 64)
#I am on iteration 2 and 'data' chunk has shape:(300, 64, 64)
#I am on iteration 3 and 'data' chunk has shape:(400, 64, 64)
#I am on iteration 4 and 'data' chunk has shape:(500, 64, 64)
#I am on iteration 5 and 'data' chunk has shape:(600, 64, 64)
#I am on iteration 6 and 'data' chunk has shape:(700, 64, 64)
#I am on iteration 7 and 'data' chunk has shape:(800, 64, 64)
#I am on iteration 8 and 'data' chunk has shape:(900, 64, 64)
#I am on iteration 9 and 'data' chunk has shape:(1000, 64, 64)

如何使用 h5py 将数据附加到 hdf5 文件中的一个特定数据集

How to append data to one specific dataset in a hdf5 file with h5py

python

numpy

hdf5

h5py

deep-learning