在 pandas 中使用 blosc 压缩会导致堆损坏

Question

我已经使用 Pandas 一段时间了，但我是 HDF5 的新手，所以我正在尝试学习它并将我的一些研究数据文件转换为 HDF5 文件。我浏览了一堆关于 python 和 HDF5 的 SO 帖子，我对使用 BLOSC 压缩算法很感兴趣（我们对数据集进行了大量计算，因此 read/write 速度更高优先于存储大小）。

在使用 pandas.to_hdf 时，我运行遇到了 blosc 压缩库的问题。当我使用 blosc 时，python 崩溃，当我在 Visual Studio 2010 中打开调试时，我得到

Unhandled exception at 0x00007ffcd59fa28c in python.exe: 0xC0000374: A heap has been corrupted.

我在脚本中设置了一个单独的示例并遇到了同样的问题：

import pandas as pd

test = pd.DataFrame()
test['random1'] = np.random.randn(1000000)
test['random2'] = np.random.randn(1000000)
test['random3'] = np.random.randn(1000000)

# Write out a csv first to compare file sizes
test.to_csv('./examples/data/random_3c.csv')

# Write out using different compression algorithms to compare
test.to_hdf('./examples/data/random_3c_zlib.h5',
            key='Random_3Col', mode='w', format='table', 
            append=False, complevel=9, complib='zlib', fletcher32=True)

test.to_hdf('./examples/data/random_3c_blosc.h5',
            key='Random_3Col', mode='w', format='table', 
            append=False, complevel=9, complib='blosc', fletcher32=True)

csv 写得很好（文件大小为 65,217 kb）
zlib 压缩写得很好（文件大小为 21,719 kb）
blosc 压缩使内核崩溃，当我在 VS
中打开调试时收到堆损坏消息我的 pandas 版本是 0.16.2
我的 PyTables 版本是 3.2.0
我还从 hdfgroup
安装了 hdf5 我正在 windows 机器上工作

在这一点上，我什至不确定如何开始追踪导致崩溃的原因。有什么建议或以前有人见过吗？我发现有些人在尝试使用外部 blosc 库时遇到 SO 问题，但我还没有接近它。我想我会先让基础工作起来！据我所知，pandas 正在使用与 blosc 版本捆绑在一起的 pytables。

谢谢！

Answer 1

如果您使用的是 anaconda 发行版，这是一个包构建问题：Pytables 3.2, python 3.4 under windows x64 · Issue #458 · ContinuumIO/anaconda-issues。您可以观看并等待修复。

在 pandas 中使用 blosc 压缩会导致堆损坏

Using blosc compression in pandas causes heap corruption

python

hdf5

pytables

pandas