了解 zlib header; CMF(CM,CINFO),FLG,(FDICT/DICTID,FLEVEL); RFC1950 § 2.2。数据格式

Understanding the zlib header; CMF (CM, CINFO), FLG, (FDICT/DICTID, FLEVEL); RFC1950 § 2.2. Data format

我对 zlib 数据格式很好奇,并试图理解 RFC1950 (https://www.rfc-editor.org/rfc/rfc1950) 中描述的 zlib header。然而,我对这种低层次的解释很陌生,似乎 运行 与我的一些结论不一致。

我有以下压缩数据(来自 PDF 流 object):

b'h\xdebbd\x10`b`Rcb`\xb0ab`\xdc\x0b\xa4\x93\x98\x18\xfe>\x06\xb2\xed\x01\x02\x0c\x00!\xa4\x03\xc4'

在python中,我已经成功解压re-compressed数据:

b'x\xdacbd\x10`b`Rcb`\xb0ab`\xdc\x0b\xa4\x93\x98\x18\xfe>\x06\xb2\xed\x01!\xa4\x03\xc4'

据我了解 中的 discussion/answer 压缩数据的结果差异应该无关紧要,因为它是压缩数据的不同应用方法的结果。

假设最后四个字节 !\xa4\x03\xc4 是 ADLER32(Adler-32 校验和)我的问题与前 2 个字节有关。

  0   1     0   1   2   3                             0   1   2   3
+---+---+ +---+---+---+---+ +=====================+ +---+---+---+---+
|CMF|FLG| |    [DICTID]   | |...compressed data...| |    ADLER32    |
+---+---+ +---+---+---+---+ +=====================+ +---+---+---+---+

CMF

第一个字节代表 CMF,在我的两个实例中是

This byte is divided into a 4-bit compression method and a 4-bit information field depending on the compression method.

  • bits 0 to 3 CM Compression method

  • bits 4 to 7 CINFO Compression info

+----|----+      +----|----+     +----|----+
|0000|0000| i.e. |0110|1000| and |0111|1000|
+----|----+      +----|----+     +----|----+
  CM |CINFO        CM |CINFO       CM |CINFO

在哪里

[CM] identifies the compression method used in the file. CM = 8 denotes the "deflate" compression method with a window size up to >32K. This is the method used by gzip and PNG (see CM = 15 is reserved.

For CM = 8, CINFO is the base-2 logarithm of the LZ77 window size, minus eight (CINFO=7 indicates a 32K window size). Values of CINFO above 7 are not allowed in this version of the specification. CINFO is not defined in this specification for CM not equal to 8.

据我了解,

比照

You should NOT assume that it's always 8. Instead, you should check it and, if it's not 8, throw a "not supported" error.

比照https://groups.google.com/forum/#!msg/comp.compression/_y2Wwn_Vq_E/EymIVcQ52cEJ

zlib 当前所有 64 种可能性的详尽列表 headers:

COMMON
78 01
78 5e
78 9c
78 da
RARE
08 1d   18 19   28 15   38 11   48 0d   58 09   68 05
08 5b   18 57   28 53   38 4f   48 4b   58 47   68 43
08 99   18 95   28 91   38 8d   48 89   58 85   68 81
08 d7   18 d3   28 cf   38 cb   48 c7   58 c3   68 de
VERY RARE
08 3c   18 38   28 34   38 30   48 2c   58 28   68 24   78 3f
08 7a   18 76   28 72   38 6e   48 6a   58 66   68 62   78 7d
08 b8   18 b4   28 b0   38 ac   48 a8   58 a4   68 bf   78 bb
08 f6   18 f2   28 ee   38 ea   48 e6   58 e2   68 fd   78 f9

Q1 我的第一个问题很简单

据我所知,字节顺序在这里不是问题。我怀疑它可能与最不重要的 (RFC1950 § 2.1。总体约定)有关,但我不太明白它会如何导致,例如,78 而不是 87...

Q2 我的第二个问题

FLG

第二个字节代表FLG

\xde -> 11011110
\xda -> 11011010

[FLG] [...] is divided as follows:

  • bits 0 to 4 FCHECK (check bits for CMF and FLG)

  • bit 5 FDICT (preset dictionary)

  • bits 6 to 7 FLEVEL (compression level)

+-----|-|--+      +-----|-|--+     +-----|-|--+
|00000|0|00| i.e. |11011|1|10| and |11011|0|10|
+-----|-|--+      +-----|-|--+     +-----|-|--+
   C  |D| L          C  |D| L         C  |D| L

据我所知,第 0-4 位是某种形式的“校验和”或完整性控制?

第 5 位表示字典是否存在。

FDICT (Preset dictionary) If FDICT is set, a DICT dictionary identifier is present immediately after the FLG byte. The dictionary is a sequence of bytes which are initially fed to the compressor without producing any compressed output. DICT is the Adler-32 checksum of this sequence of bytes (see the definition of ADLER32 below). The decompressor can use this identifier to determine which dictionary has been used by the compressor.

Q3 我的第三个问题

假设“1”表示“已设置”

\xde -> 11011_1_10
\xda -> 11011_0_10

根据规范,DICTID 由 4 个字节组成。我拥有的压缩流中的以下四个字节是

bbd\x10
cbd\x10

为什么来自 PDF 流 object(使用 FDICT 1)的压缩数据和使用 python zlib(使用 FDICT 0)的压缩数据几乎相同?

虽然我不明白DICTID的作用,但它不应该只有设置了FDICT才存在吗?

Q4 我的第四个问题

位 6-7 设置 FLEVEL(压缩级别)

These flags are available for use by specific compression methods. The "deflate" method (CM = 8) sets these flags as follows:

0 - compressor used fastest algorithm

1 - compressor used fast algorithm

2 - compressor used default algorithm

3 - compressor used maximum compression, slowest algorithm

The information in FLEVEL is not needed for decompression; it is there to indicate if recompression might be worthwhile.

我原以为标志会是:

0 (00)
1 (01)
2 (10)
3 (11)

但是从 What does a zlib header look like?

01 (00000001) - No Compression/low
[5e (01011100) - Default Compression?]
9c (10011100) - Default Compression
da (11011010) - Best Compression

不过我注意到两个 left-most 位似乎符合我的预期;我觉得我显然没有理解如何解释位的基本知识...

RFC 说:

CMF (Compression Method and flags)
         This byte is divided into a 4-bit compression method and a 4-
         bit information field depending on the compression method.

            bits 0 to 3  CM     Compression method
            bits 4 to 7  CINFO  Compression info

一个字节的最低有效位是位0。有效位是位7。所以你为将 CM 和 CINFO 映射到位是反向的。 0x780x68 的 CM 均为 8。它们的 CINFO 分别为 7 和 6。

CINFO 是 RFC 所说的:

CINFO (Compression info)
   For CM = 8, CINFO is the base-2 logarithm of the LZ77 window
   size, minus eight (CINFO=7 indicates a 32K window size).

因此,CINFO 为 7 意味着 32 KiB window。 6 表示 16 KiB。 CINFO == 0 not 表示不压缩。这意味着 window 大小为 256 字节。

对于标志字节,你又把它倒过来了。 FDICT 设置。对于您的两个示例,压缩级别都是 11,最大压缩。