在 h5py 中压缩文件更大

compressed files bigger in h5py

我正在使用 h5py 从 python 中以 HDF5 格式保存 numpy 数组。最近,我尝试应用压缩,但我得到的文件更大...

我从这样的事情(每个文件都有几个数据集)开始

self._h5_current_frame.create_dataset(
        'estimated position', shape=estimated_pos.shape, 
         dtype=float, data=estimated_pos)

像这样的事情

self._h5_current_frame.create_dataset(
        'estimated position', shape=estimated_pos.shape, dtype=float,
        data=estimated_pos, compression="gzip", compression_opts=9)

在特定示例中,压缩文件的大小为 172K,未压缩文件的大小为 72K(h5diff 报告两个文件相等)。我尝试了一个更基本的示例,它按预期工作......但不是在我的程序中。

这怎么可能?我不认为 gzip 算法会提供更大的压缩文件,所以它可能与 h5py 及其使用有关:-/有什么想法吗?

干杯!!

编辑:

h5stat 的输出来看,压缩版本似乎保存了大量元数据(在输出的最后几行)

压缩文件

Filename: res_totolaca_jue_2015-10-08_17:06:30_19387.hdf5
File information
    # of unique groups: 21
    # of unique datasets: 56
    # of unique named datatypes: 0
    # of unique links: 0
    # of unique other: 0
    Max. # of links to object: 1
    Max. # of objects in group: 5
File space information for file metadata (in bytes):
    Superblock extension: 0
    User block: 0
    Object headers: (total/unused)
        Groups: 3798/503
        Datasets(exclude compact data): 15904/9254
        Datatypes: 0/0
    Groups:
        B-tree/List: 0
        Heap: 0
    Attributes:
        B-tree/List: 0
        Heap: 0
    Chunked datasets:
        Index: 116824
    Datasets:
        Heap: 0
    Shared Messages:
        Header: 0
        B-tree/List: 0
        Heap: 0
Small groups (with 0 to 9 links):
    # of groups with 1 link(s): 1
    # of groups with 2 link(s): 5
    # of groups with 3 link(s): 5
    # of groups with 5 link(s): 10
    Total # of small groups: 21
Group bins:
    # of groups with 1 - 9 links: 21
    Total # of groups: 21
Dataset dimension information:
    Max. rank of datasets: 3
    Dataset ranks:
        # of dataset with rank 1: 51
        # of dataset with rank 2: 3
        # of dataset with rank 3: 2
1-D Dataset information:
    Max. dimension size of 1-D datasets: 624
    Small 1-D datasets (with dimension sizes 0 to 9):
        # of datasets with dimension sizes 1: 36
        # of datasets with dimension sizes 2: 2
        # of datasets with dimension sizes 3: 2
        Total # of small datasets: 40
    1-D Dataset dimension bins:
        # of datasets with dimension size 1 - 9: 40
        # of datasets with dimension size 10 - 99: 2
        # of datasets with dimension size 100 - 999: 9
        Total # of datasets: 51
Dataset storage information:
    Total raw data size: 33602
    Total external raw data size: 0
Dataset layout information:
    Dataset layout counts[COMPACT]: 0
    Dataset layout counts[CONTIG]: 2
    Dataset layout counts[CHUNKED]: 54
    Number of external files : 0
Dataset filters information:
    Number of datasets with:
        NO filter: 2
        GZIP filter: 54
        SHUFFLE filter: 0
        FLETCHER32 filter: 0
        SZIP filter: 0
        NBIT filter: 0
        SCALEOFFSET filter: 0
        USER-DEFINED filter: 0
Dataset datatype information:
    # of unique datatypes used by datasets: 4
    Dataset datatype #0:
        Count (total/named) = (20/0)
        Size (desc./elmt) = (14/8)
    Dataset datatype #1:
        Count (total/named) = (17/0)
        Size (desc./elmt) = (22/8)
    Dataset datatype #2:
        Count (total/named) = (10/0)
        Size (desc./elmt) = (22/8)
    Dataset datatype #3:
        Count (total/named) = (9/0)
        Size (desc./elmt) = (14/8)
    Total dataset datatype count: 56
Small # of attributes (objects with 1 to 10 attributes):
    Total # of objects with small # of attributes: 0
Attribute bins:
    Total # of objects with attributes: 0
    Max. # of attributes to objects: 0
Summary of file space information:
  File metadata: 136526 bytes
  Raw data: 33602 bytes
  Unaccounted space: 5111 bytes
Total space: 175239 bytes

未压缩文件

Filename: res_totolaca_jue_2015-10-08_17:03:04_19267.hdf5
File information
    # of unique groups: 21
    # of unique datasets: 56
    # of unique named datatypes: 0
    # of unique links: 0
    # of unique other: 0
    Max. # of links to object: 1
    Max. # of objects in group: 5
File space information for file metadata (in bytes):
    Superblock extension: 0
    User block: 0
    Object headers: (total/unused)
        Groups: 3663/452
        Datasets(exclude compact data): 15904/10200
        Datatypes: 0/0
    Groups:
        B-tree/List: 0
        Heap: 0
    Attributes:
        B-tree/List: 0
        Heap: 0
    Chunked datasets:
        Index: 0
    Datasets:
        Heap: 0
    Shared Messages:
        Header: 0
        B-tree/List: 0
        Heap: 0
Small groups (with 0 to 9 links):
    # of groups with 1 link(s): 1
    # of groups with 2 link(s): 5
    # of groups with 3 link(s): 5
    # of groups with 5 link(s): 10
    Total # of small groups: 21
Group bins:
    # of groups with 1 - 9 links: 21
    Total # of groups: 21
Dataset dimension information:
    Max. rank of datasets: 3
    Dataset ranks:
        # of dataset with rank 1: 51
        # of dataset with rank 2: 3
        # of dataset with rank 3: 2
1-D Dataset information:
    Max. dimension size of 1-D datasets: 624
    Small 1-D datasets (with dimension sizes 0 to 9):
        # of datasets with dimension sizes 1: 36
        # of datasets with dimension sizes 2: 2
        # of datasets with dimension sizes 3: 2
        Total # of small datasets: 40
    1-D Dataset dimension bins:
        # of datasets with dimension size 1 - 9: 40
        # of datasets with dimension size 10 - 99: 2
        # of datasets with dimension size 100 - 999: 9
        Total # of datasets: 51
Dataset storage information:
    Total raw data size: 50600
    Total external raw data size: 0
Dataset layout information:
    Dataset layout counts[COMPACT]: 0
    Dataset layout counts[CONTIG]: 56
    Dataset layout counts[CHUNKED]: 0
    Number of external files : 0
Dataset filters information:
    Number of datasets with:
        NO filter: 56
        GZIP filter: 0
        SHUFFLE filter: 0
        FLETCHER32 filter: 0
        SZIP filter: 0
        NBIT filter: 0
        SCALEOFFSET filter: 0
        USER-DEFINED filter: 0
Dataset datatype information:
    # of unique datatypes used by datasets: 4
    Dataset datatype #0:
        Count (total/named) = (20/0)
        Size (desc./elmt) = (14/8)
    Dataset datatype #1:
        Count (total/named) = (17/0)
        Size (desc./elmt) = (22/8)
    Dataset datatype #2:
        Count (total/named) = (10/0)
        Size (desc./elmt) = (22/8)
    Dataset datatype #3:
        Count (total/named) = (9/0)
        Size (desc./elmt) = (14/8)
    Total dataset datatype count: 56
Small # of attributes (objects with 1 to 10 attributes):
    Total # of objects with small # of attributes: 0
Attribute bins:
    Total # of objects with attributes: 0
    Max. # of attributes to objects: 0
Summary of file space information:
  File metadata: 19567 bytes
  Raw data: 50600 bytes
  Unaccounted space: 5057 bytes
Total space: 75224 bytes

首先,这是一个可重现的例子:

import h5py
from scipy.misc import lena

img = lena()    # some compressible image data

f1 = h5py.File('nocomp.h5', 'w')
f1.create_dataset('img', data=img)
f1.close()

f2 = h5py.File('complevel_9.h5', 'w')
f2.create_dataset('img', data=img, compression='gzip', compression_opts=9)
f2.close()

f3 = h5py.File('complevel_0.h5', 'w')
f3.create_dataset('img', data=img, compression='gzip', compression_opts=0)
f3.close()

现在让我们看看文件大小:

~$ h5stat -S nocomp.h5
Filename: nocomp.h5
Summary of file space information:
  File metadata: 1304 bytes
  Raw data: 2097152 bytes
  Unaccounted space: 840 bytes
Total space: 2099296 bytes

~$ h5stat -S complevel_9.h5
Filename: complevel_9.h5
Summary of file space information:
  File metadata: 11768 bytes
  Raw data: 302850 bytes
  Unaccounted space: 1816 bytes
Total space: 316434 bytes

~$ h5stat -S complevel_0.h5
Filename: complevel_0.h5
Summary of file space information:
  File metadata: 11768 bytes
  Raw data: 2098560 bytes
  Unaccounted space: 1816 bytes
Total space: 2112144 bytes

在我的示例中,使用 gzip -9 进行压缩是有意义的 - 虽然它需要额外的 ~10kB 元数据,但这远远超过了图像数据大小减少 ~1794kB(大约 7:1压缩比)。最终结果是总文件大小减少了约 6.6 倍。

但是,在您的示例中,压缩仅将原始数据的大小减少了约 16kB(压缩比约为 1.5:1),元数据的大小增加了 116kB,远远超过了这一点。元数据大小的增加比我的示例大得多的原因可能是因为您的文件包含 56 个数据集,而不是一个。

即使 gzip 神奇地将原始数据的大小减小到零,您最终得到的文件仍会比未压缩版本大 ~1.8 倍。元数据的大小或多或少可以保证与数组的大小呈次线性关系,因此如果您的数据集大得多,那么您会开始看到压缩它们的一些好处。就目前而言,您的数组非常小,您不太可能从压缩中获得任何好处。


更新:

压缩版本需要这么多元数据的原因与压缩本身并没有关系,而是因为为了使用压缩过滤器,数据集需要 split into fixed-size chunks. Presumably a lot of the extra metadata is being used to store the B-tree 需要索引块。

f4 = h5py.File('nocomp_autochunked.h5', 'w')
# let h5py pick a chunk size automatically
f4.create_dataset('img', data=img, chunks=True)
print(f4['img'].chunks)
# (32, 64)
f4.close()

f5 = h5py.File('nocomp_onechunk.h5', 'w')
# make the chunk shape the same as the shape of the array, so that there 
# is only one chunk
f5.create_dataset('img', data=img, chunks=img.shape)
print(f5['img'].chunks)
# (512, 512)
f5.close()

f6 = h5py.File('complevel_9_onechunk.h5', 'w')
f6.create_dataset('img', data=img, chunks=img.shape, compression='gzip',
                  compression_opts=9)
f6.close()

生成的文件大小:

~$ h5stat -S nocomp_autochunked.h5
Filename: nocomp_autochunked.h5
Summary of file space information:
  File metadata: 11768 bytes
  Raw data: 2097152 bytes
  Unaccounted space: 1816 bytes
Total space: 2110736 bytes

~$ h5stat -S nocomp_onechunk.h5
Filename: nocomp_onechunk.h5
Summary of file space information:
  File metadata: 3920 bytes
  Raw data: 2097152 bytes
  Unaccounted space: 96 bytes
Total space: 2101168 bytes

~$ h5stat -S complevel_9_onechunk.h5
Filename: complevel_9_onechunk.h5
Summary of file space information:
  File metadata: 3920 bytes
  Raw data: 305051 bytes
  Unaccounted space: 96 bytes
Total space: 309067 bytes

很明显,分块导致了额外的元数据而不是压缩,因为 nocomp_autochunked.h5 包含与上面的 complevel_0.h5 完全相同数量的元数据,并且在 [=] 中对分块版本引入了压缩17=] 对元数据总量没有影响。

在此示例中,增加块大小以便将数组存储为单个块将元数据量减少了大约 3 倍。这对您的情况有多大影响可能取决于 h5py 如何自动为您的输入数据集选择块大小。有趣的是,这也导致压缩率有非常轻微的减少,这不是我所预测的。

请记住,拥有更大的块也有缺点。每当你想访问块中的单个元素时,整个块都需要解压缩并读入内存。对于大型数据集,这可能会对性能造成灾难性影响,但在您的情况下,数组非常小,可能不值得担心。

您应该考虑的另一件事是您是否可以将数据集存储在单个数组而不是许多小数组中。例如,如果你有 K 相同数据类型的二维数组,每个数组的维度都是 MxN 那么你可以更有效地将它们存储在 KxMxN 3D 数组而不是大量的小数据集。我对你的数据了解不够,不知道这是否可行。