保存到 hdf5 非常慢（Python 冻结）

Question

我正在尝试将瓶颈值保存到新创建的 hdf5 文件中。瓶颈值以 (120,10,10, 2048) 的形状成批出现。保存一个单独的批次占用了超过 16 个演出，python 似乎冻结在那个批次上。根据最近的发现（查看更新，hdf5 占用大内存似乎还可以，但冻结部分似乎是一个故障。

我只是想保存前 2 个批次用于测试目的，并且只保存训练数据集（再一次，这是一个测试运行），但我连第一批都过不了。它只是停在第一批，不会循环到下一次迭代。如果我尝试检查 hdf5，资源管理器会变慢，并且 Python 会冻结。如果我尝试终止 Python（即使不检查 hdf5 文件），Python 也不会正确关闭并强制重启。

相关代码和数据如下：

总数据点约为90,000 ish，以120个为一批发布。

Bottleneck shape is (120,10,10,2048)

所以我要保存的第一批是 (120,10,10,2048)

以下是我尝试保存数据集的方式：

with h5py.File(hdf5_path, mode='w') as hdf5:
                hdf5.create_dataset("train_bottle", train_shape, np.float32)
                hdf5.create_dataset("train_labels", (len(train.filenames), params['bottle_labels']),np.uint8)
                hdf5.create_dataset("validation_bottle", validation_shape, np.float32)
                hdf5.create_dataset("validation_labels",
                                              (len(valid.filenames),params['bottle_labels']),np.uint8)



 #this first part above works fine

                current_iteration = 0
                print('created_datasets')
                for x, y in train:

                    number_of_examples = len(train.filenames) # number of images
                    prediction = model.predict(x)
                    labels = y
                    print(prediction.shape) # (120,10,10,2048)
                    print(y.shape) # (120, 12)
                    print('start',current_iteration*params['batch_size']) # 0
                    print('end',(current_iteration+1) * params['batch_size']) # 120

                    hdf5["train_bottle"][current_iteration*params['batch_size']: (current_iteration+1) * params['batch_size'],...] = prediction
                    hdf5["train_labels"][current_iteration*params['batch_size']: (current_iteration+1) * params['batch_size'],...] = labels
                    current_iteration += 1
                    print(current_iteration)
                    if current_iteration == 3:
                       break

这是打印语句的输出：

(90827, 10, 10, 2048) # print(train_shape)

(6831, 10, 10, 2048)  # print(validation_shape)
created_datasets
(120, 10, 10, 2048)  # print(prediction.shape)
(120, 12)           #label.shape
start 0             #start of batch
end 120             #end of batch

# Just stalls here instead of printing `print(current_iteration)`

它只是在这里停顿了一会儿（20 分钟以上），hdf5 文件的大小慢慢变大（现在大约 20 gig，在我强行杀死之前）。实际上我什至不能用任务管理器强行杀死，我必须重新启动 OS，在这种情况下才能真正杀死 Python。

更新

稍微研究一下我的代码后，似乎有一个奇怪的 bug/behavior。

相关部分在这里：

          hdf5["train_bottle"][current_iteration*params['batch_size']: (current_iteration+1) * params['batch_size'],...] = prediction
                hdf5["train_labels"][current_iteration*params['batch_size']: (current_iteration+1) * params['batch_size'],...] = labels

如果我运行这些行中的任何一行，我的脚本将经历迭代，并按预期自动中断。因此，如果我运行非此即彼，则不会冻结。它也发生得相当快——不到一分钟。

如果我运行第一行 ('train_bottle')，我的内存将占用大约 69-72 gig，即使它只有几批。如果我尝试更多批次，内存是相同的。因此，我假设 train_bottle 根据我分配给数据集的大小参数决定存储，而不是实际填充时。因此，尽管有 72 场演出，但运行相当快（一分钟）。

如果我运行第二行 train_labels ，我的内存会占用几兆字节。迭代没有问题，执行break语句

但是，现在问题来了，如果我尝试运行两行（在我的情况下这是必要的，因为我需要同时保存 'train_bottle' 和 'train_labels'），我在第一次迭代时遇到冻结，即使在 20 分钟后，它也不会继续到第二次迭代。 Hdf5 文件增长缓慢，但如果我尝试访问它，Windows Explorer 会变慢，我无法关闭 Python -- 我必须重新启动 OS。

所以我不确定在尝试运行两行时问题是什么——就好像我运行内存不足 train_data 行一样，如果工作正常并且一分钟内结束。

Answer 1

正在向 HDF5 写入数据

如果您在不指定块形状的情况下写入分块数据集，h5py 会自动为您执行此操作。由于 h5py 无法知道您不想从数据集中写入或读取数据，这通常会导致性能不佳。

您还使用默认的 chunk-cache-size 1 MB。如果您只写入一个块的一部分并且该块不适合缓存（1MP chunk-cache-size 很可能），整个块将在内存中读取，修改并写回磁盘。如果这种情况发生多次，您将看到远远超出 HDD/SSD.

的顺序 IO-speed 的性能

在下面的示例中，我假设您只沿着第一维度进行读写。如果不是，则必须根据您的需要进行修改。

import numpy as np
import tables #register blosc
import h5py as h5
import h5py_cache as h5c
import time

batch_size=120
train_shape=(90827, 10, 10, 2048)
hdf5_path='Test.h5'
# As we are writing whole chunks here this isn't realy needed,
# if you forget to set a large enough chunk-cache-size when not writing or reading 
# whole chunks, the performance will be extremely bad. (chunks can only be read or written as a whole)
f = h5c.File(hdf5_path, 'w',chunk_cache_mem_size=1024**2*200) #200 MB cache size
dset_train_bottle = f.create_dataset("train_bottle", shape=train_shape,dtype=np.float32,chunks=(10, 10, 10, 2048),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)
prediction=np.array(np.arange(120*10*10*2048),np.float32).reshape(120,10,10,2048)
t1=time.time()
#Testing with 2GB of data
for i in range(20):
    #prediction=np.array(np.arange(120*10*10*2048),np.float32).reshape(120,10,10,2048)
    dset_train_bottle[i*batch_size:(i+1)*batch_size,:,:,:]=prediction

f.close()
print(time.time()-t1)
print("MB/s: " + str(2000/(time.time()-t1)))

编辑循环中的数据创建花费了很多时间，所以我在时间测量之前创建数据。

这应该至少提供 900 MB/s 吞吐量（CPU 有限）。使用真实数据和较低的压缩率，您应该可以轻松达到硬盘的顺序 IO-speed。

如果您错误地多次调用此块，则使用 with 语句打开 HDF5 文件也会导致性能下降。这将关闭并重新打开文件，删除 chunk-cache.

为了确定权利chunk-size我还建议：

Answer 2

如果您有足够的DDR内存，并且想要极快的数据加载和保存性能，请直接使用np.load()&np.save()。 np.load()&np.save() 可以为您提供最快的数据加载和保存性能，到目前为止，我找不到任何其他工具或框架可以与之抗衡，即使是 HDF5 的性能也只有它的 1/5 ~ 1/7。

Answer 3

这个回答更像是对@max9111和@Clock ZHONG争论的评论。我写这篇文章是为了帮助其他人想知道哪个更快 HDF5 或 np.save().

我使用了@max9111提供的代码，并按照@Clock ZHONG的建议进行了修改。可以在 https://github.com/wornbb/save_speed_test.

找到确切的 jupyter notebook

简而言之，根据我的规格：

固态硬盘：三星 960 EVO
CPU: i7-7700K
内存：2133 兆赫 16GB
OS: 赢 10

HDF5 达到 1339.5 MB/s 而 np.save 仅为 924.9 MB/s（未压缩）。

此外，正如@Clock ZHONG 所指出的，he/she lzf -Filter 存在问题。如果你也有这个问题，发布 jupyter notebook 可以运行 with conda distribution of python3 with pip installed packages on win 10.

保存到 hdf5 非常慢（Python 冻结）

Saving to hdf5 is very slow (Python freezing)

python

numpy

hdf5

keras

更新