C++/C多线程同时读取gz文件

Question

我正在尝试从多个线程读取 gzip 压缩文件。

我认为这会显着加快解压缩过程，因为我的多个线程中的 gzread 函数从不同的文件偏移量开始（使用 gseek），因此它们读取文件的不同部分。

简化代码如下

// in threads
auto gf = gzopen("file.gz",xxx);
gzseek(gf,offset);
gzread(xx);
gzclose(gf);

令我惊讶的是，我的多线程版本程序根本没有加速。 20 线程版本使用与单线程版本完全相同的时间。我很确定这离磁盘瓶颈还很远。

我猜 zlib inflation 功能可能需要解压缩整个文件才能阅读一小部分，但我未能从他们的手册中获得任何线索。

有人知道如何加快我的速度吗？

Answer 1

zlib 实现没有多线程 (http://www.zlib.net/zlib_faq.html#faq21 - "Is zlib thread-safe? - Yes. ... Of course, you should only operate on any given zlib or gzip stream from a single thread at a time.") 并且会解压缩 "entire file" 到寻找的位置。

并且 zlib 格式对齐错误（位对齐）/没有偏移字段（deflate format）以启用并行 decompression/seeking。

您可以尝试 z (deflate/inflate) 的另一种实现，例如 http://zlib.net/pigz/（或者从单核时代的古老压缩切换到非 zlib 现代并行格式，xz/lzma/something 来自 google)

pigz, which stands for parallel implementation of gzip, is a fully functional replacement for gzip that exploits multiple processors and multiple cores to the hilt when compressing data. pigz was written by Mark Adler, and uses the zlib and pthread libraries. To compile and use pigz, please read the README file in the source code distribution. You can read the pigz manual page here.

手册页是 http://zlib.net/pigz/pigz.pdf，其中包含有用的信息。

它使用与zlib兼容的格式，但采用并行压缩：

Each partial raw deflate stream is terminated by an empty stored block ... in order to end that partial bit stream at a byte boundary.

DEFLATE 格式仍然不利于并行解压缩：

Decompression can’t be parallelized, at least not without specially prepared deflate streams for that purpose. Asaresult, pigz uses a single thread (the main thread) for decompression, but will create three other threads for reading, writing, and check calculation, which can speed up decompression under some circumstances.

Answer 2

tl;dr：zlib 不是为随机访问而设计的。 It seems possible to implement，虽然需要完整通读才能建立索引，因此它可能对您的情况没有帮助。

让我们看看 zlib source. gzseek is a wrapper around gzseek64，其中包含：

/* if within raw area while reading, just go there */
if (state->mode == GZ_READ && state->how == COPY &&
        state->x.pos + offset >= 0) {

如果我们正在处理 gzip 文件，

"Within raw area" 听起来不太正确。查一下gzguts.h中state->how的意思：

int how; /* 0: get header, 1: copy, 2: decompress */

没错。在 gz_open 的末尾，对 gz_reset 的调用将 how 设置为 0。返回 gzseek64，我们最终对状态进行了以下修改：

state->seek = 1;
state->skip = offset;

gzread, when called, processes this with a call to gz_skip:

if (state->seek) {
    state->seek = 0;
    if (gz_skip(state, state->skip) == -1)
        return -1;
}

再往下看这个兔子洞，我们发现 gz_skip 调用 gz_fetch 直到 gz_fetch 处理了所需搜索的足够输入。 gz_fetch，在其第一次循环迭代中，调用 gz_look 设置 state->how = GZIP，这导致 gz_fetch 从输入中解压缩数据。换句话说，您的怀疑是正确的：当您使用 gzseek.

时，zlib 确实将整个文件解压缩到那个点。

Answer 3

简短回答：由于压缩流的串行性质，gzseek() 必须解码从开始到请求的搜索点的所有压缩数据。因此，您无法从尝试做的事情中获得任何收益。事实上，花费的总周期将随着压缩数据长度的平方而增加！所以不要那样做。

C++/C多线程同时读取gz文件

C++/C Multiple threads to read gz file simultaneously

c++

compression

multithreading

zlib

deflate