Pandas 和 HDF5 中的文件大小减少

Question

我正在运行建立一个模型，该模型将数据输出到多个 Pandas 帧中，然后将这些帧保存到 HDF5 文件中。该模型运行数百次，每次都将新列（多索引）添加到现有 HDF5 文件的帧中。这是通过 Pandas merge 完成的。由于每个运行的帧长度不同，最终帧中有大量 NaN 个值。

完成足够多的模型运行后，如果行或列与出错的模型运行相关联，数据将从帧中删除。在那个过程中，新的数据帧被放入一个新的 HDF5 文件中。下面的伪python显示了这个过程：

with pandas.HDFStore(filename) as store:
    # figure out which indices should be removed
    indices_to_drop = get_bad_indices(store)

    new_store = pandas.HDFStore(reduced_filename) 
    for key in store.keys():
        df = store[key]
        for idx in indices_to_drop:
             df = df.drop(idx, <level and axis info>)
        new_store[key] = df
    new_store.close()

新的 hdf5 文件最终大约是原始文件大小的 10%。文件中的唯一区别是所有 NaN 值不再相等（但都是 numpy float64 值）。

我的问题是，如何在现有的 hdf5 文件上实现这种文件大小缩减（大概是通过管理 NaN 值）？有时我不需要执行上述程序，但无论如何我都会这样做以获得减少。是否有可以执行此操作的现有 Pandas 或 PyTables 命令？非常感谢您。

Answer 1

查看文档 here

警告说明了一切：

Warning Please note that HDF5 DOES NOT RECLAIM SPACE in the h5 files automatically. Thus, repeatedly deleting (or removing nodes) and adding again WILL TEND TO INCREASE THE FILE SIZE. To clean the file, use ptrepack

Pandas 和 HDF5 中的文件大小减少

File Size Reduction in Pandas and HDF5

python

numpy

hdf5

pytables

pandas