了解 zlib header; CMF(CM,CINFO),FLG,(FDICT/DICTID,FLEVEL); RFC1950 § 2.2。数据格式
Understanding the zlib header; CMF (CM, CINFO), FLG, (FDICT/DICTID, FLEVEL); RFC1950 § 2.2. Data format
我对 zlib 数据格式很好奇,并试图理解 RFC1950 (https://www.rfc-editor.org/rfc/rfc1950) 中描述的 zlib header。然而,我对这种低层次的解释很陌生,似乎 运行 与我的一些结论不一致。
我有以下压缩数据(来自 PDF 流 object):
b'h\xdebbd\x10`b`Rcb`\xb0ab`\xdc\x0b\xa4\x93\x98\x18\xfe>\x06\xb2\xed\x01\x02\x0c\x00!\xa4\x03\xc4'
在python中,我已经成功解压re-compressed数据:
b'x\xdacbd\x10`b`Rcb`\xb0ab`\xdc\x0b\xa4\x93\x98\x18\xfe>\x06\xb2\xed\x01!\xa4\x03\xc4'
据我了解 中的 discussion/answer
压缩数据的结果差异应该无关紧要,因为它是压缩数据的不同应用方法的结果。
假设最后四个字节 !\xa4\x03\xc4
是 ADLER32(Adler-32 校验和)我的问题与前 2 个字节有关。
0 1 0 1 2 3 0 1 2 3
+---+---+ +---+---+---+---+ +=====================+ +---+---+---+---+
|CMF|FLG| | [DICTID] | |...compressed data...| | ADLER32 |
+---+---+ +---+---+---+---+ +=====================+ +---+---+---+---+
CMF
第一个字节代表 CMF,在我的两个实例中是
chr h = dec 104 = hex 68 = 01101000
- 和
chr x = dec 120 = hex 78 = 01111000
This byte is divided into a 4-bit compression method and a 4-bit information field depending on the compression method.
bits 0 to 3 CM Compression method
bits 4 to 7 CINFO Compression info
+----|----+ +----|----+ +----|----+
|0000|0000| i.e. |0110|1000| and |0111|1000|
+----|----+ +----|----+ +----|----+
CM |CINFO CM |CINFO CM |CINFO
在哪里
[CM] identifies the compression method used in the file.
CM = 8 denotes the "deflate" compression method with a window size up to >32K. This is the method used by gzip and PNG (see
CM = 15 is reserved.
和
For CM = 8, CINFO is the base-2 logarithm of the LZ77 window size, minus eight (CINFO=7 indicates a 32K window size). Values of CINFO above 7 are not allowed in this version of the specification. CINFO is not defined in this specification for CM not equal to 8.
据我了解,
- 唯一有效的 CM 是 8
- CINFO 可以是 0-7
比照
You should NOT assume that it's always 8. Instead, you should check it and, if it's not 8, throw a "not supported" error.
比照https://groups.google.com/forum/#!msg/comp.compression/_y2Wwn_Vq_E/EymIVcQ52cEJ
zlib 当前所有 64 种可能性的详尽列表 headers:
COMMON
78 01
78 5e
78 9c
78 da
RARE
08 1d 18 19 28 15 38 11 48 0d 58 09 68 05
08 5b 18 57 28 53 38 4f 48 4b 58 47 68 43
08 99 18 95 28 91 38 8d 48 89 58 85 68 81
08 d7 18 d3 28 cf 38 cb 48 c7 58 c3 68 de
VERY RARE
08 3c 18 38 28 34 38 30 48 2c 58 28 68 24 78 3f
08 7a 18 76 28 72 38 6e 48 6a 58 66 68 62 78 7d
08 b8 18 b4 28 b0 38 ac 48 a8 58 a4 68 bf 78 bb
08 f6 18 f2 28 ee 38 ea 48 e6 58 e2 68 fd 78 f9
Q1 我的第一个问题很简单
- 为什么CINFO在CM之前?即
- 为什么不是 87, 80, 81, 82, 83, ...
据我所知,字节顺序在这里不是问题。我怀疑它可能与最不重要的 位 (RFC1950 § 2.1。总体约定)有关,但我不太明白它会如何导致,例如,78 而不是 87...
Q2 我的第二个问题
- 如果CINFO 7表示“a window size up to 32K”,那么1-6对应什么? (假设 0 表示 window 大小 0,如未应用压缩)。
FLG
第二个字节代表FLG
\xde -> 11011110
\xda -> 11011010
[FLG] [...] is divided as follows:
bits 0 to 4 FCHECK (check bits for CMF and FLG)
bit 5 FDICT (preset dictionary)
bits 6 to 7 FLEVEL (compression level)
+-----|-|--+ +-----|-|--+ +-----|-|--+
|00000|0|00| i.e. |11011|1|10| and |11011|0|10|
+-----|-|--+ +-----|-|--+ +-----|-|--+
C |D| L C |D| L C |D| L
据我所知,第 0-4 位是某种形式的“校验和”或完整性控制?
第 5 位表示字典是否存在。
FDICT (Preset dictionary)
If FDICT is set, a DICT dictionary identifier is present immediately after the FLG byte. The dictionary is a sequence of bytes which are initially fed to the compressor without producing any compressed output. DICT is the Adler-32 checksum of this sequence of bytes (see the definition of ADLER32 below). The decompressor can use this identifier to determine which dictionary has been used by the compressor.
Q3 我的第三个问题
假设“1”表示“已设置”
\xde -> 11011_1_10
\xda -> 11011_0_10
根据规范,DICTID 由 4 个字节组成。我拥有的压缩流中的以下四个字节是
bbd\x10
cbd\x10
为什么来自 PDF 流 object(使用 FDICT 1)的压缩数据和使用 python zlib(使用 FDICT 0)的压缩数据几乎相同?
虽然我不明白DICTID的作用,但它不应该只有设置了FDICT才存在吗?
Q4 我的第四个问题
位 6-7 设置 FLEVEL(压缩级别)
These flags are available for use by specific compression methods. The "deflate" method (CM = 8) sets these flags as follows:
0 - compressor used fastest algorithm
1 - compressor used fast algorithm
2 - compressor used default algorithm
3 - compressor used maximum compression, slowest algorithm
The information in FLEVEL is not needed for decompression; it is there to indicate if recompression might be worthwhile.
我原以为标志会是:
0 (00)
1 (01)
2 (10)
3 (11)
但是从 What does a zlib header look like?
01 (00000001) - No Compression/low
[5e (01011100) - Default Compression?]
9c (10011100) - Default Compression
da (11011010) - Best Compression
不过我注意到两个 left-most 位似乎符合我的预期;我觉得我显然没有理解如何解释位的基本知识...
RFC 说:
CMF (Compression Method and flags)
This byte is divided into a 4-bit compression method and a 4-
bit information field depending on the compression method.
bits 0 to 3 CM Compression method
bits 4 to 7 CINFO Compression info
一个字节的最低有效位是位0。最有效位是位7。所以你为将 CM 和 CINFO 映射到位是反向的。 0x78
和 0x68
的 CM 均为 8。它们的 CINFO 分别为 7 和 6。
CINFO 是 RFC 所说的:
CINFO (Compression info)
For CM = 8, CINFO is the base-2 logarithm of the LZ77 window
size, minus eight (CINFO=7 indicates a 32K window size).
因此,CINFO 为 7 意味着 32 KiB window。 6 表示 16 KiB。 CINFO == 0 not 表示不压缩。这意味着 window 大小为 256 字节。
对于标志字节,你又把它倒过来了。 FDICT 未设置。对于您的两个示例,压缩级别都是 11
,最大压缩。
我对 zlib 数据格式很好奇,并试图理解 RFC1950 (https://www.rfc-editor.org/rfc/rfc1950) 中描述的 zlib header。然而,我对这种低层次的解释很陌生,似乎 运行 与我的一些结论不一致。
我有以下压缩数据(来自 PDF 流 object):
b'h\xdebbd\x10`b`Rcb`\xb0ab`\xdc\x0b\xa4\x93\x98\x18\xfe>\x06\xb2\xed\x01\x02\x0c\x00!\xa4\x03\xc4'
在python中,我已经成功解压re-compressed数据:
b'x\xdacbd\x10`b`Rcb`\xb0ab`\xdc\x0b\xa4\x93\x98\x18\xfe>\x06\xb2\xed\x01!\xa4\x03\xc4'
据我了解
假设最后四个字节 !\xa4\x03\xc4
是 ADLER32(Adler-32 校验和)我的问题与前 2 个字节有关。
0 1 0 1 2 3 0 1 2 3
+---+---+ +---+---+---+---+ +=====================+ +---+---+---+---+
|CMF|FLG| | [DICTID] | |...compressed data...| | ADLER32 |
+---+---+ +---+---+---+---+ +=====================+ +---+---+---+---+
CMF
第一个字节代表 CMF,在我的两个实例中是
chr h = dec 104 = hex 68 = 01101000
- 和
chr x = dec 120 = hex 78 = 01111000
This byte is divided into a 4-bit compression method and a 4-bit information field depending on the compression method.
bits 0 to 3 CM Compression method
bits 4 to 7 CINFO Compression info
+----|----+ +----|----+ +----|----+
|0000|0000| i.e. |0110|1000| and |0111|1000|
+----|----+ +----|----+ +----|----+
CM |CINFO CM |CINFO CM |CINFO
在哪里
[CM] identifies the compression method used in the file. CM = 8 denotes the "deflate" compression method with a window size up to >32K. This is the method used by gzip and PNG (see CM = 15 is reserved.
和
For CM = 8, CINFO is the base-2 logarithm of the LZ77 window size, minus eight (CINFO=7 indicates a 32K window size). Values of CINFO above 7 are not allowed in this version of the specification. CINFO is not defined in this specification for CM not equal to 8.
据我了解,
- 唯一有效的 CM 是 8
- CINFO 可以是 0-7
比照
You should NOT assume that it's always 8. Instead, you should check it and, if it's not 8, throw a "not supported" error.
比照https://groups.google.com/forum/#!msg/comp.compression/_y2Wwn_Vq_E/EymIVcQ52cEJ
zlib 当前所有 64 种可能性的详尽列表 headers:
COMMON
78 01
78 5e
78 9c
78 da
RARE
08 1d 18 19 28 15 38 11 48 0d 58 09 68 05
08 5b 18 57 28 53 38 4f 48 4b 58 47 68 43
08 99 18 95 28 91 38 8d 48 89 58 85 68 81
08 d7 18 d3 28 cf 38 cb 48 c7 58 c3 68 de
VERY RARE
08 3c 18 38 28 34 38 30 48 2c 58 28 68 24 78 3f
08 7a 18 76 28 72 38 6e 48 6a 58 66 68 62 78 7d
08 b8 18 b4 28 b0 38 ac 48 a8 58 a4 68 bf 78 bb
08 f6 18 f2 28 ee 38 ea 48 e6 58 e2 68 fd 78 f9
Q1 我的第一个问题很简单
- 为什么CINFO在CM之前?即
- 为什么不是 87, 80, 81, 82, 83, ...
据我所知,字节顺序在这里不是问题。我怀疑它可能与最不重要的 位 (RFC1950 § 2.1。总体约定)有关,但我不太明白它会如何导致,例如,78 而不是 87...
Q2 我的第二个问题
- 如果CINFO 7表示“a window size up to 32K”,那么1-6对应什么? (假设 0 表示 window 大小 0,如未应用压缩)。
FLG
第二个字节代表FLG
\xde -> 11011110
\xda -> 11011010
[FLG] [...] is divided as follows:
bits 0 to 4 FCHECK (check bits for CMF and FLG)
bit 5 FDICT (preset dictionary)
bits 6 to 7 FLEVEL (compression level)
+-----|-|--+ +-----|-|--+ +-----|-|--+
|00000|0|00| i.e. |11011|1|10| and |11011|0|10|
+-----|-|--+ +-----|-|--+ +-----|-|--+
C |D| L C |D| L C |D| L
据我所知,第 0-4 位是某种形式的“校验和”或完整性控制?
第 5 位表示字典是否存在。
FDICT (Preset dictionary) If FDICT is set, a DICT dictionary identifier is present immediately after the FLG byte. The dictionary is a sequence of bytes which are initially fed to the compressor without producing any compressed output. DICT is the Adler-32 checksum of this sequence of bytes (see the definition of ADLER32 below). The decompressor can use this identifier to determine which dictionary has been used by the compressor.
Q3 我的第三个问题
假设“1”表示“已设置”
\xde -> 11011_1_10
\xda -> 11011_0_10
根据规范,DICTID 由 4 个字节组成。我拥有的压缩流中的以下四个字节是
bbd\x10
cbd\x10
为什么来自 PDF 流 object(使用 FDICT 1)的压缩数据和使用 python zlib(使用 FDICT 0)的压缩数据几乎相同?
虽然我不明白DICTID的作用,但它不应该只有设置了FDICT才存在吗?
Q4 我的第四个问题
位 6-7 设置 FLEVEL(压缩级别)
These flags are available for use by specific compression methods. The "deflate" method (CM = 8) sets these flags as follows:
0 - compressor used fastest algorithm
1 - compressor used fast algorithm
2 - compressor used default algorithm
3 - compressor used maximum compression, slowest algorithm
The information in FLEVEL is not needed for decompression; it is there to indicate if recompression might be worthwhile.
我原以为标志会是:
0 (00)
1 (01)
2 (10)
3 (11)
但是从 What does a zlib header look like?
01 (00000001) - No Compression/low
[5e (01011100) - Default Compression?]
9c (10011100) - Default Compression
da (11011010) - Best Compression
不过我注意到两个 left-most 位似乎符合我的预期;我觉得我显然没有理解如何解释位的基本知识...
RFC 说:
CMF (Compression Method and flags)
This byte is divided into a 4-bit compression method and a 4-
bit information field depending on the compression method.
bits 0 to 3 CM Compression method
bits 4 to 7 CINFO Compression info
一个字节的最低有效位是位0。最有效位是位7。所以你为将 CM 和 CINFO 映射到位是反向的。 0x78
和 0x68
的 CM 均为 8。它们的 CINFO 分别为 7 和 6。
CINFO 是 RFC 所说的:
CINFO (Compression info)
For CM = 8, CINFO is the base-2 logarithm of the LZ77 window
size, minus eight (CINFO=7 indicates a 32K window size).
因此,CINFO 为 7 意味着 32 KiB window。 6 表示 16 KiB。 CINFO == 0 not 表示不压缩。这意味着 window 大小为 256 字节。
对于标志字节,你又把它倒过来了。 FDICT 未设置。对于您的两个示例,压缩级别都是 11
,最大压缩。