在不加载整个矩阵的情况下随机读取 .h5 文件中的元素
Read randomly elements in .h5 file without loading whole matrix
我有一个无法放入 RAM 的巨大训练数据集。我试图在不加载整个 .h5 的情况下在堆栈中加载随机批次的图像。我的方法是创建一个索引列表并打乱它们,而不是打乱整个 .h5 文件。
比方说:
a = np.arange(2000*2000*2000).reshape(2000, 2000, 2000)
idx = np.random.randint(2000, size = 800) #so that I only need to shuffle this idx at the end of epoch
# create this huge data 32GBs > my RAM
with h5py.File('./tmp.h5', 'w') as f:
tmp = f.create_dataset('a', (2000, 2000, 2000))
tmp[:] = a
# read it
with h5py.File('./tmp.h5', 'r') as f:
tensor = f['a'][:][idx] #if I don't do [:] there will be error if I do so it will load whole file which I don't want
有人有解决办法吗?
感谢@max9111,这是我建议的解决方法:
batch_size = 100
idx = np.arange(2000)
# shuffle
idx = np.random.shuffle(idx)
Selection coordinates must be given in increasing order
阅读前应先排序:
for step in range(epoch_len // batch_size):
try:
with h5py.File(path, 'r') as f:
return f['img'][np.sort(idx[step * batch_size])], f['label'][np.sort(idx[step * batch_size])]
except:
raise('epoch finished and drop the remainder')
我有一个无法放入 RAM 的巨大训练数据集。我试图在不加载整个 .h5 的情况下在堆栈中加载随机批次的图像。我的方法是创建一个索引列表并打乱它们,而不是打乱整个 .h5 文件。 比方说:
a = np.arange(2000*2000*2000).reshape(2000, 2000, 2000)
idx = np.random.randint(2000, size = 800) #so that I only need to shuffle this idx at the end of epoch
# create this huge data 32GBs > my RAM
with h5py.File('./tmp.h5', 'w') as f:
tmp = f.create_dataset('a', (2000, 2000, 2000))
tmp[:] = a
# read it
with h5py.File('./tmp.h5', 'r') as f:
tensor = f['a'][:][idx] #if I don't do [:] there will be error if I do so it will load whole file which I don't want
有人有解决办法吗?
感谢@max9111,这是我建议的解决方法:
batch_size = 100
idx = np.arange(2000)
# shuffle
idx = np.random.shuffle(idx)
Selection coordinates must be given in increasing order
阅读前应先排序:
for step in range(epoch_len // batch_size):
try:
with h5py.File(path, 'r') as f:
return f['img'][np.sort(idx[step * batch_size])], f['label'][np.sort(idx[step * batch_size])]
except:
raise('epoch finished and drop the remainder')