fastparquet 中的压缩选项不一致

compression option in fastparquet is not consistent

根据project page of fastparquetfastparquet支持各种压缩方式

Optional (compression algorithms; gzip is always available):

snappy (aka python-snappy)
lzo
brotli
lz4
zstandard

尤其是 zstandard 是一种现代算法,可提供高压缩比以及令人印象深刻的快速 compression/decompression 速度。这就是我想要的 fastparquet。

但是在 fastparquet.write

的文档中

compression to apply to each column, e.g. GZIP or SNAPPY or a dict like {"col1": "SNAPPY", "col2": None} to specify per column compression types. In both cases, the compressor settings would be the underlying compressor defaults. To pass arguments to the underlying compressor, each dict entry should itself be a dictionary:

{
    col1: {
        "type": "LZ4",
        "args": {
            "compression_level": 6,
            "content_checksum": True
         }
    },
    col2: {
        "type": "SNAPPY",
        "args": None
    }
    "_default": {
        "type": "GZIP",
        "args": None
    }
}

没有提到 zstandard。更糟糕的是,如果我写

fastparquet.write('outfile.parq', df, compression='LZ4')

弹出错误说

Compression 'LZ4' not available. Options: ['GZIP', 'UNCOMPRESSED']

所以fastparquest只支持'GZIP'?这与项目页面有很大的差异!我错过了一些包裹吗?如何将 fastparquest 与所有项目页面规定的压缩算法一起使用?

是的,您可能缺少一些包裹。您的系统必须首先具有 python LZ4 and/or zstandard 绑定。有关详细信息,请参阅 the source code

  • 对于 LZ4:如果 import lz4.block 给出 ModuleNotFoundError,则继续安装 pip install lz4

  • 与 zstandard 类似:pip install zstandard

  • 对于 brotli:pip install brotlipy

  • 和lzo:pip install python-lzo

  • 而且活泼:pip install python-snappy