部分填充的 HDF5 数据集在磁盘上的大小

Question

我正在阅读 Python 和 HDF5 (O'Reilly) 这本书，其中有一节介绍了空数据集及其在磁盘上的大小：

import numpy as np
import h5py

f = h5py.File("testfile.hdf5")
dset = f.create_dataset("big dataset", (1024**3,), dtype=np.float32)
f.flush()
# Size on disk is 1KB

dset[0:1024] = np.arange(1024)
f.flush()
# Size on disk is 4GB

用值填充数据集的部分（前 1024 个条目）后，我预计文件会增长，但不会增长到 4GB。它与我做的时候大小基本相同：

dset[...] = np.arange(1024**3)

书上说磁盘上的文件大小应该在 66KB 左右。谁能解释一下突然变大的原因是什么？

版本信息：

Python 3.6.1 (OSX)
h5py 2.7.0

Answer 1

如果您在 HdfView 中打开文件，您会看到分块已关闭。这意味着该数组存储在文件中的一个连续内存块中，并且不能调整大小。因此必须在文件中分配所有 4 GB。

如果您在启用分块的情况下创建数据集，数据集将被分成大小规则的块，这些块随意存储在磁盘上，并使用 B 树进行索引。在那种情况下，只有具有（至少一个元素）数据的块才会分配到磁盘上。如果您按如下方式创建数据集，文件将会小得多：

dset = f.create_dataset("big dataset", (1024**3,), dtype=np.float32, chunks=True)

chunks=True 让 h5py 自动确定块的大小。您还可以显式设置块大小。例如，要将其设置为 16384 个浮点数 (=64 Kb)，请使用：

dset = f.create_dataset("big dataset", (1024**3,), dtype=np.float32, chunks=(2**14,) )

最佳块大小取决于应用程序的读写模式。请注意：

Chunking has performance implications. It’s recommended to keep the total size of your chunks between 10 KiB and 1 MiB, larger for larger datasets. Also keep in mind that when any element in a chunk is accessed, the entire chunk is read from disk.

见http://docs.h5py.org/en/latest/high/dataset.html#chunked-storage

部分填充的 HDF5 数据集在磁盘上的大小

Size on disk of a partly filled HDF5 dataset

python-3.x

h5py