Python 解压缩相对性能？

Question

TLDR； the various compression algorithms available in pythongzip、bz2、lzma等，哪个解压性能最好？

完整讨论：

Python 3 有 various modules for compressing/decompressing data 包括 gzip、bz2 和 lzma。 gzip 和 bz2 还可以设置不同的压缩级别。

如果我的目标是平衡文件大小（/压缩比）和解压速度（压缩速度不是问题），哪个是最佳选择？解压速度比文件大小更重要，但由于所讨论的未压缩文件每个大约 600-800MB（32 位 RGB .png 图像文件），而我有十几个，我确实想要一些压缩。

我的用例是我从磁盘加载十几个图像，对它们进行一些处理（作为一个 numpy 数组），然后在我的程序中使用处理后的数组数据。
- 图像永远不会改变，我只需要在每次运行我的程序时加载它们。
- 处理时间与加载时间（几秒）大致相同，因此我试图通过保存处理后的数据（使用 pickle）而不是加载原始数据来节省一些加载时间，未处理，每次都是图像。最初的测试很有希望 - 加载 raw/uncompressed 腌制数据用了不到一秒，而加载和处理原始图像需要 3 或 4 秒 - 但如上所述导致文件大小约为 600-800MB，而原始 png图片只有 5MB 左右。所以我希望我可以通过以压缩格式存储选择的数据来在加载时间和文件大小之间取得平衡。
更新：情况实际上比我上面描述的要复杂一些。我的应用程序使用 PySide2，因此我可以访问 Qt 库。
- 如果我读取图像并使用 pillow (PIL.Image) 转换为 numpy 数组，我实际上不需要做任何处理，但是将图像读入的总时间数组大约是 4 秒。
- 如果我改为使用 QImage 来读取图像，那么由于 [=27] 的字节顺序，我必须对结果进行一些处理以使其可用于我程序的其余部分=] 加载数据 - 基本上我必须交换位顺序然后旋转每个 "pixel" 以便 alpha 通道（显然是由 QImage 添加的）排在最后而不是第一个。整个过程大约需要 3.8 秒，因此略微比仅使用 PIL 快。
- 如果我将 numpy 数组保存为未压缩的，那么我可以在 0.8 秒内将它们加载回来，因此是目前最快的，但文件较大。

┌────────────┬────────────────────────┬───────────────┬─────────────┐
│ Python Ver │     Library/Method     │ Read/unpack + │ Compression │
│            │                        │ Decompress (s)│    Ratio    │
├────────────┼────────────────────────┼───────────────┼─────────────┤
│ 3.7.2      │ pillow (PIL.Image)     │ 4.0           │ ~0.006      │
│ 3.7.2      │ Qt (QImage)            │ 3.8           │ ~0.006      │
│ 3.7.2      │ numpy (uncompressed)   │ 0.8           │ 1.0         │
│ 3.7.2      │ gzip (compresslevel=9) │ ?             │ ?           │
│ 3.7.2      │ gzip (compresslevel=?) │ ?             │ ?           │
│ 3.7.2      │ bz2 (compresslevel=9)  │ ?             │ ?           │
│ 3.7.2      │ bz2 (compresslevel=?)  │ ?             │ ?           │
│ 3.7.2      │ lzma                   │ ?             │ ?           │
├────────────┼────────────────────────┼───────────────┼─────────────┤
│ 3.7.3      │ ?                      │ ?             │ ?           │  
├────────────┼────────────────────────┼───────────────┼─────────────┤
│ 3.8beta1   │ ?                      │ ?             │ ?           │
├────────────┼────────────────────────┼───────────────┼─────────────┤
│ 3.8.0final │ ?                      │ ?             │ ?           │
├────────────┼────────────────────────┼───────────────┼─────────────┤
│ 3.5.7      │ ?                      │ ?             │ ?           │
├────────────┼────────────────────────┼───────────────┼─────────────┤
│ 3.6.10     │ ?                      │ ?             │ ?           │
└────────────┴────────────────────────┴───────────────┴─────────────┘

示例 .png 图片： 以 this 5.0Mb png image, a fairly high resolution image of the coastline of Alaska.

为例

png/PIL 案例的代码（加载到 numpy 数组）：

from PIL import Image
import time
import numpy

start = time.time()
FILE = '/path/to/file/AlaskaCoast.png'
Image.MAX_IMAGE_PIXELS = None
img = Image.open(FILE)
arr = numpy.array(img)
print("Loaded in", time.time()-start)

此负载在我的 Python 3.7.2.

机器上大约需要 4.2 秒

或者，我可以加载通过选取上面创建的数组生成的未压缩 pickle 文件。

未压缩 pickle 负载情况的代码：

import pickle
import time

start = time.time()    
with open('/tmp/test_file.pickle','rb') as picklefile:
  arr = pickle.load(picklefile)    
print("Loaded in", time.time()-start)

从这个未压缩的 pickle 文件加载在我的机器上需要大约 0.8 秒。

Answer 1

我认为应该快的是

使用 gzip（或其他）进行压缩
直接将压缩数据作为文字字节存储在python模块中
将解压后的表格直接加载到numpy数组中

即编写一个生成源代码的程序，如

import gzip, numpy
data = b'\x00\x01\x02\x03'
unpacked = numpy.frombuffer(gzip.uncompress(data), numpy.uint8)

打包数据最终直接编码到 .pyc 文件中

对于 low-entropy 数据 gzip 解压缩应该是相当快的（编辑：不足为奇 lzma 甚至更快，它仍然是一个预定义的 python 模块）

使用您的 "alaska" 数据，此方法在我的机器上提供了以下性能

compression   source module size   bytecode size   import time
-----------   ------------------   -------------   -----------
gzip -9               26,133,461       9,458,176          1.79
lzma                  11,534,009       2,883,695          1.08

您甚至可以仅分发 .pyc，前提是您可以控制所使用的 python 版本；在 Python 2 中加载 .pyc 的代码是一行代码，但现在更加复杂（显然决定加载 .pyc 不应该很方便）。

请注意，模块的编译速度相当快（例如，lzma 版本在我的机器上仅需 0.1 秒即可编译），但遗憾的是无缘无故地在磁盘上浪费了 11Mb。

Answer 2

low-hanging果实

numpy.savez_compressed('AlaskaCoast.npz', arr)
arr = numpy.load('AlaskaCoast.npz')['arr_0']

加载速度比您的 PIL-based 代码快 2.3 倍。

它使用 zipfile.ZIP_DEFLATED，请参阅 savez_compressed 文档。

您的 PIL 代码也有一个不需要的副本：array(img) 应该是 asarray(img)。它只花费缓慢加载时间的 5%。但是在优化之后这将很重要，你必须记住哪些 numpy 运算符创建了一个副本。

快速解压

照zstd benchmarks, when optimizing for decompression lz4是个不错的选择。只需将其插入 pickle 即可获得 2.4 倍的增益，并且仅比未压缩的 pickle 慢 30%。

import pickle
import lz4.frame

# with lz4.frame.open('AlaskaCoast.lz4', 'wb') as f:
#     pickle.dump(arr, f)

with lz4.frame.open('AlaskaCoast.lz4', 'rb') as f:
    arr = pickle.load(f)

基准

method                 size   load time
------                 ----   ---------
original (PNG+PIL)     5.1M   7.1
np.load (compressed)   6.7M   3.1
pickle + lz4           7.1M   1.3
pickle (uncompressed)  601M   1.0 (baseline)

加载时间是在 Python (3.7.3) 内测得的，在我的桌面上使用超过 20 次运行的最小 wall-clock 时间。偶尔看一下 top，它似乎总是运行在单核上。

好奇者：分析

我不确定 Python 版本是否重要，大部分工作应该在 C 库中进行。为了验证这一点，我分析了 pickle + lz4 变体：

perf record ./test.py && perf report -s dso
Overhead  Shared Object
  60.16%  [kernel.kallsyms]  # mostly page_fault and alloc_pages_vma
  27.53%  libc-2.28.so       # mainly memmove
   9.75%  liblz4.so.1.8.3    # only LZ4_decompress_*
   2.33%  python3.7
   ...

大部分时间花在 Linux 内核内部，做 page_fault 和与（重新）分配内存相关的事情，可能包括磁盘 I/O。大量 memmove 看起来很可疑。每次新的解压缩块到达时，可能 Python 是 re-allocating（调整大小）最终数组。如果有人喜欢仔细看看：python and perf profiles.

Answer 3

您可以继续使用现有的 PNG 并享受 space 的节省，但使用 libvips 可以获得一些速度。这是一个比较，但我没有测试我的笔记本电脑与你的笔记本电脑的速度，而是展示了 3 种不同的方法，这样你就可以看到相对速度。我用过：

太平船
OpenCV
pyvips

#!/usr/bin/env python3

import numpy as np
import pyvips
import cv2
from PIL import Image

def usingPIL(f):
    im = Image.open(f)
    return np.asarray(im)

def usingOpenCV(f):
    arr = cv2.imread(f,cv2.IMREAD_UNCHANGED)
    return arr

def usingVIPS(f):
    image = pyvips.Image.new_from_file(f)
    mem_img = image.write_to_memory()
    imgnp=np.frombuffer(mem_img, dtype=np.uint8).reshape(image.height, image.width, 3) 
    return imgnp

然后我检查了 IPython 的性能，因为它有很好的计时功能。如您所见，pyvips 比 PIL 快 13 倍，即使 PIL 比原始版本快 2 倍，因为避免了数组复制：

In [49]: %timeit usingPIL('Alaska1.png')                                                            
3.66 s ± 31.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [50]: %timeit usingOpenCV('Alaska1.png')                                                         
6.82 s ± 23.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [51]: %timeit usingVIPS('Alaska1.png')                                                           
276 ms ± 4.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Quick test results match
np.sum(usingVIPS('Alaska1.png') - usingPIL('Alaska1.png')) 
0

Answer 4

您可以使用Python-blosc

它是 very fast 并且对于小数组 (<2GB) 也很容易使用。在像您的示例这样易于压缩的数据上，压缩数据以进行 IO 操作通常会更快。（SATA-SSD：大约 500 MB/s，PCIe-SSD：高达 3500MB/s）在解压缩步骤中，阵列分配是成本最高的部分。如果你的图像形状相似，你可以避免重复内存分配。

例子

以下示例假定一个连续数组。

import blosc
import pickle

def compress(arr,Path):
    #c = blosc.compress_ptr(arr.__array_interface__['data'][0], arr.size, arr.dtype.itemsize, clevel=3,cname='lz4',shuffle=blosc.SHUFFLE)
    c = blosc.compress_ptr(arr.__array_interface__['data'][0], arr.size, arr.dtype.itemsize, clevel=3,cname='zstd',shuffle=blosc.SHUFFLE)
    f=open(Path,"wb")
    pickle.dump((arr.shape, arr.dtype),f)
    f.write(c)
    f.close()
    return c,arr.shape, arr.dtype

def decompress(Path):
    f=open(Path,"rb")
    shape,dtype=pickle.load(f)
    c=f.read()
    #array allocation takes most of the time
    arr=np.empty(shape,dtype)
    blosc.decompress_ptr(c, arr.__array_interface__['data'][0])
    return arr

#Pass a preallocated array if you have many similar images
def decompress_pre(Path,arr):
    f=open(Path,"rb")
    shape,dtype=pickle.load(f)
    c=f.read()
    #array allocation takes most of the time
    blosc.decompress_ptr(c, arr.__array_interface__['data'][0])
    return arr

基准测试

#blosc.SHUFFLE, cname='zstd' -> 4728KB,  
%timeit compress(arr,"Test.dat")
1.03 s ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#611 MB/s
%timeit decompress("Test.dat")
146 ms ± 481 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#4310 MB/s
%timeit decompress_pre("Test.dat",arr)
50.9 ms ± 438 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#12362 MB/s

#blosc.SHUFFLE, cname='lz4' -> 9118KB, 
%timeit compress(arr,"Test.dat")
32.1 ms ± 437 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#19602 MB/s
%timeit decompress("Test.dat")
146 ms ± 332 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#4310 MB/s
%timeit decompress_pre("Test.dat",arr)
53.6 ms ± 82.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#11740 MB/s

编辑

这个版本更适合一般用途。它确实处理 f-contiguous、c-contiguous 和 non-contiguous 数组以及 >2GB 的数组。也看看 bloscpack.

import blosc
import pickle

def compress(file, arr,clevel=3,cname='lz4',shuffle=1):
    """
    file           path to file
    arr            numpy nd-array
    clevel         0..9
    cname          blosclz,lz4,lz4hc,snappy,zlib
    shuffle        0-> no shuffle, 1->shuffle,2->bitshuffle
    """
    max_blk_size=100_000_000 #100 MB 

    shape=arr.shape
    #dtype np.object is not implemented
    if arr.dtype==np.object:
        raise(TypeError("dtype np.object is not implemented"))

    #Handling of fortran ordered arrays (avoid copy)
    is_f_contiguous=False
    if arr.flags['F_CONTIGUOUS']==True:
        is_f_contiguous=True
        arr=arr.T.reshape(-1)
    else:
        arr=np.ascontiguousarray(arr.reshape(-1))

    #Writing
    max_num=max_blk_size//arr.dtype.itemsize
    num_chunks=arr.size//max_num

    if arr.size%max_num!=0:
        num_chunks+=1

    f=open(file,"wb")
    pickle.dump((shape,arr.size,arr.dtype,is_f_contiguous,num_chunks,max_num),f)
    size=np.empty(1,np.uint32)
    num_write=max_num
    for i in range(num_chunks):
        if max_num*(i+1)>arr.size:
            num_write=arr.size-max_num*i
        c = blosc.compress_ptr(arr[max_num*i:].__array_interface__['data'][0], num_write, 
                               arr.dtype.itemsize, clevel=clevel,cname=cname,shuffle=shuffle)
        size[0]=len(c)
        size.tofile(f)
        f.write(c)
    f.close()

def decompress(file,prealloc_arr=None):
    f=open(file,"rb")
    shape,arr_size,dtype,is_f_contiguous,num_chunks,max_num=pickle.load(f)

    if prealloc_arr is None:
        if prealloc_arr.flags['F_CONTIGUOUS']==True
            prealloc_arr=prealloc_arr.T
        if prealloc_arr.flags['C_CONTIGUOUS']!=True
            raise(TypeError("Contiguous array is needed"))
        arr=np.empty(arr_size,dtype)
    else:
        arr=np.frombuffer(prealloc_arr.data, dtype=dtype, count=arr_size)

    for i in range(num_chunks):
        size=np.fromfile(f,np.uint32,count=1)
        c=f.read(size[0])
        blosc.decompress_ptr(c, arr[max_num*i:].__array_interface__['data'][0])
    f.close()

    #reshape
    if is_f_contiguous:
        arr=arr.reshape(shape[::-1]).T
    else:
        arr=arr.reshape(shape)
    return arr

Python 解压缩相对性能？

Python decompression relative performance?

python

performance

gzip

lzma

bz2

low-hanging果实

快速解压

基准

好奇者：分析

您可以使用Python-blosc