Deflate压缩块的结构

Question

我很难理解 Deflate 算法 (RFC 1951)。

TL; DR如何解析Deflate压缩块4be4 0200?

我创建了一个文件，其中包含一个字母和换行符 a\n，以及运行 gzip a.txt。结果文件 a.txt.gz:

1f8b 0808 fe8b eb55 0003 612e 7478 7400

4be4 0200

07a1 eadd 0200 0000

我知道第一行是 header 和附加信息，最后一行是 CRC32 加上输入的大小 (RFC 1951)。这两个不给我带来麻烦。

但是我如何解释压缩块本身（中间行）？

这是它的十六进制和二进制表示形式：

据我了解，不知何故这些：

Each block of compressed data begins with 3 header bits containing the following data:

first bit BFINAL

next 2 bits BTYPE

...实际上在第一个字节的 end 结束：0100 1011。（我将跳过这个问题，为什么有人会称“header”实际上位于其他东西尾部的东西。）

RFC 包含的内容据我所知应该是对此的解释：

Data elements are packed into bytes in order of increasing bit number within the byte, i.e., starting with the least-significant bit of the byte.

Data elements other than Huffman codes are packed starting with the least-significant bit of the data element.

Huffman codes are packed starting with the most- significant bit of the code.

In other words, if one were to print out the compressed data as a sequence of bytes, starting with the first byte at the right margin and proceeding to the left, with the most- significant bit of each byte on the left as usual, one would be able to parse the result from right to left, with fixed-width elements in the correct MSB-to-LSB order and Huffman codes in bit-reversed order (i.e., with the first bit of the code in the relative LSB position).

但遗憾的是我不明白这个解释。

正在返回我的数据。好的，那么 BFINAL 已设置，BTYPE 是什么？ 10 还是 01？

如何解释该压缩块中的其余数据？

Answer 1

首先让我们看一下压缩数据的十六进制表示形式为一系列字节（而不是您问题中的一系列 16 位大端值）：

4b e4 02 00

现在让我们将这些十六进制字节转换为二进制：

01001011 11100100 00000010 000000000

根据 RFC，这些位是“从字节的最低有效位开始”打包的。一个字节的最低有效位是该字节的最右边的位。所以第一个字节的第一位是这个：

01001011 11100100 00000010 000000000
       ^
       first bit

第二位是这个：

01001011 11100100 00000010 000000000
      ^
      second bit

第三位：

01001011 11100100 00000010 000000000
     ^
     third bit

等等。一旦你遍历了第一个字节中的所有位，你就从第二个字节的最低有效位开始。所以第九位是这个：

01001011 11100100 00000010 000000000
                ^
                ninth bit

最后一点，即第 30 秒，是这个：

01001011 11100100 00000010 000000000
                           ^
                           last bit

BFINAL 值是压缩数据中的第一位，因此包含在上面标记为“第一位”的单个位中。它的值为 1，表示这是压缩数据中的最后一个块。

BTYPE 值存储在数据的下两位中。这些是上面标记为“第二位”和“第三位”的位。唯一的问题是两者中哪一位是最低有效位，哪位是最高有效位。根据 RFC，“除霍夫曼代码外的数据元素被打包从数据元素的最低有效位开始。”这意味着这两个位中的第一个，标记为“第二位”的那个是最低有效位。这意味着 BTYPE 的值是 01 binary. and so 表示该块是使用固定霍夫曼编码压缩的。

这是最简单的部分。解码压缩块的其余部分更加困难（并且在更现实的示例中，困难得多）。正确解释该怎么做会使该站点的答案太长（并且您的问题太宽泛）。不过，我会给你一个提示，数据中接下来的三个元素是霍夫曼代码 10010001 ('a')、00111010 ('\n') 和 0000000（流结束）。其余 6 位未使用，不是压缩数据的一部分。

请注意，要了解如何解码 deflate 压缩数据，您必须了解 Huffman codes are. The RFC you're following assumes that you do. You should also know how LZ77 compression 的工作原理，尽管该文档或多或少地解释了您需要了解的内容。

Deflate压缩块的结构

The structure of Deflate compressed block

unix

assembly

gzip

bit-manipulation

deflate