我是否正确打包字节以进行 DEFLATE 压缩?
Am I packing bytes correctly for DEFLATE compression?
所以,我知道有很多库可用于进行 DEFLATE 压缩。如果我正在开发生产产品,我会使用 zlib 之类的东西。但作为一种爱好,我正在自己实施它以尝试弄清楚。因此,经过几周的编码、重新编码和调整,我终于能够在合理的时间范围内产生一些我认为合适的输出。但是,如果我尝试 post 我的输出到其中一个在线工具中,我会收到不一定能帮助我找出输出问题的错误。当我让我的程序生成一个实际的位串并手动解析它时,一切似乎都符合 DEFLATE 标准,我可以重建我的数据。这让我相信我的编码是正确的,但我可能完全误解了打包字节时不同的位顺序。以下是我的输出的 Base64 编码版本,然后是我的程序生成的 8 位字节列表。如果有人可以帮助我指出数据失败的地方,将不胜感激。
Defective program output (both Base64 and raw bytes):
Base64 Encoded Output:
ZYQhAQAADMKqQBWagELQXz/AzTQX+eAB
Byte List:
01100101
10000100
00100001
00000001
00000000
00000000
00001100
11000010
10101010
01000000
00010101
10011010
10000000
01000010
11010000
01011111
00111111
11000000
11001101
00110100
00010111
11111001
11100000
00000001
作为我对文档的理解的概述。该标准表示一个块以 1 位开始,以声明它是否是最后一个块,然后是 2 位以声明使用哪种压缩类型,然后是 5 位 hlit、5 位 hdist、4 位 hclen,然后是 hclen + 4 组 3 位,每组给出代码长度对于用于输出 literal/length 代码的代码长度以及距离代码的霍夫曼代码。在此之后是 hlit+257+hdist+1 代码长度的霍夫曼编码字符串,最后是实际压缩数据的霍夫曼编码字符串,由块代码结尾封顶。有趣的是,霍夫曼代码本身是按相反顺序打包的……然而我感到困惑的地方是一些长度代码(代码 16、17、18)之后的 "extra bits" 以及更高的长度和距离代码。那些以与霍夫曼代码相同的相反顺序打包还是被视为 "data other than huffman code"?
Looking at first byte in list (byte 0):
*01 = last block bit
*02 = 2bit compression type (10 = dynamic huffman)
*03 = msb of hlit (#of literal/length codes - 257)
*03 *02 *01
v v v
+-------------------------------+
| 0 1 1 0 0 1 0 1 |
+-------------------------------+
| Byte 0 |
Looking at bytes 8 and 9 (starting with byte 0):
*01 = last bit of hclen + 4 sets of codelen code lengths
*02 = msb of huffman code "10" ("10" = codelen code 18 - repeat 0 11-138 times)
*03 = lsb of 7 "extra bits" for codelen code 18
*04 = msb of 7 "extra bits" for codelen code 18
*03 *02 *01 *04
v v v v
+---------------------------------+---------------------------------+
| 1 0 1 0 1 0 1 0 | 0 1 0 0 0 0 0 0 |
+---------------------------------+---------------------------------+
| Byte 8 | Byte 9 |
这是我的程序的一些额外输出,其中包含实际使用的哈夫曼代码:
--------------------------------------------------------------------------
Literal/Length Bit Codes: Block: 0 hlit: (269 - 257) = 12
--------------------------------------------------------------------------
Code: 32 Count: 1 BitCode: 000 Bit Length: 3
Code: 33 Count: 1 BitCode: 001 Bit Length: 3
Code: 66 Count: 1 BitCode: 010 Bit Length: 3
Code: 97 Count: 1 BitCode: 011 Bit Length: 3
Code: 98 Count: 1 BitCode: 100 Bit Length: 3
Code: 104 Count: 1 BitCode: 101 Bit Length: 3
Code: 108 Count: 1 BitCode: 110 Bit Length: 3
Code: 256 Count: 1 BitCode: 1110 Bit Length: 4
Code: 268 Count: 1 BitCode: 1111 Bit Length: 4
--------------------------------------------------------------------------
Distance Bit Codes: Block: 0 hdist: (5 - 1) = 4
--------------------------------------------------------------------------
Code: 4 Count: 1 BitCode: 00 Bit Length: 2
--------------------------------------------------------------------------
CodeLength Bit Codes: Block: 0 hclen: (16 - 4) = 12
--------------------------------------------------------------------------
Code: 2 Count: 1 BitCode: 110 Bit Length: 3
Code: 3 Count: 7 BitCode: 00 Bit Length: 2
Code: 4 Count: 2 BitCode: 111 Bit Length: 3
Code: 17 Count: 4 BitCode: 01 Bit Length: 2
Code: 18 Count: 5 BitCode: 10 Bit Length: 2
--------------------------------------------------------------------------
--------------------------------------------------------------------------
多余的位没有反转。
您的问题是长度为 2 的单个距离代码是不允许的。单个距离代码的长度必须为 1。来自 RFC 1951:
If only one distance code is used, it is encoded using one bit, not
zero bits; in this case there is a single code length of one, with one
unused code.
所以,我知道有很多库可用于进行 DEFLATE 压缩。如果我正在开发生产产品,我会使用 zlib 之类的东西。但作为一种爱好,我正在自己实施它以尝试弄清楚。因此,经过几周的编码、重新编码和调整,我终于能够在合理的时间范围内产生一些我认为合适的输出。但是,如果我尝试 post 我的输出到其中一个在线工具中,我会收到不一定能帮助我找出输出问题的错误。当我让我的程序生成一个实际的位串并手动解析它时,一切似乎都符合 DEFLATE 标准,我可以重建我的数据。这让我相信我的编码是正确的,但我可能完全误解了打包字节时不同的位顺序。以下是我的输出的 Base64 编码版本,然后是我的程序生成的 8 位字节列表。如果有人可以帮助我指出数据失败的地方,将不胜感激。
Defective program output (both Base64 and raw bytes):
Base64 Encoded Output:
ZYQhAQAADMKqQBWagELQXz/AzTQX+eAB
Byte List:
01100101
10000100
00100001
00000001
00000000
00000000
00001100
11000010
10101010
01000000
00010101
10011010
10000000
01000010
11010000
01011111
00111111
11000000
11001101
00110100
00010111
11111001
11100000
00000001
作为我对文档的理解的概述。该标准表示一个块以 1 位开始,以声明它是否是最后一个块,然后是 2 位以声明使用哪种压缩类型,然后是 5 位 hlit、5 位 hdist、4 位 hclen,然后是 hclen + 4 组 3 位,每组给出代码长度对于用于输出 literal/length 代码的代码长度以及距离代码的霍夫曼代码。在此之后是 hlit+257+hdist+1 代码长度的霍夫曼编码字符串,最后是实际压缩数据的霍夫曼编码字符串,由块代码结尾封顶。有趣的是,霍夫曼代码本身是按相反顺序打包的……然而我感到困惑的地方是一些长度代码(代码 16、17、18)之后的 "extra bits" 以及更高的长度和距离代码。那些以与霍夫曼代码相同的相反顺序打包还是被视为 "data other than huffman code"?
Looking at first byte in list (byte 0):
*01 = last block bit
*02 = 2bit compression type (10 = dynamic huffman)
*03 = msb of hlit (#of literal/length codes - 257)
*03 *02 *01
v v v
+-------------------------------+
| 0 1 1 0 0 1 0 1 |
+-------------------------------+
| Byte 0 |
Looking at bytes 8 and 9 (starting with byte 0):
*01 = last bit of hclen + 4 sets of codelen code lengths
*02 = msb of huffman code "10" ("10" = codelen code 18 - repeat 0 11-138 times)
*03 = lsb of 7 "extra bits" for codelen code 18
*04 = msb of 7 "extra bits" for codelen code 18
*03 *02 *01 *04
v v v v
+---------------------------------+---------------------------------+
| 1 0 1 0 1 0 1 0 | 0 1 0 0 0 0 0 0 |
+---------------------------------+---------------------------------+
| Byte 8 | Byte 9 |
这是我的程序的一些额外输出,其中包含实际使用的哈夫曼代码:
--------------------------------------------------------------------------
Literal/Length Bit Codes: Block: 0 hlit: (269 - 257) = 12
--------------------------------------------------------------------------
Code: 32 Count: 1 BitCode: 000 Bit Length: 3
Code: 33 Count: 1 BitCode: 001 Bit Length: 3
Code: 66 Count: 1 BitCode: 010 Bit Length: 3
Code: 97 Count: 1 BitCode: 011 Bit Length: 3
Code: 98 Count: 1 BitCode: 100 Bit Length: 3
Code: 104 Count: 1 BitCode: 101 Bit Length: 3
Code: 108 Count: 1 BitCode: 110 Bit Length: 3
Code: 256 Count: 1 BitCode: 1110 Bit Length: 4
Code: 268 Count: 1 BitCode: 1111 Bit Length: 4
--------------------------------------------------------------------------
Distance Bit Codes: Block: 0 hdist: (5 - 1) = 4
--------------------------------------------------------------------------
Code: 4 Count: 1 BitCode: 00 Bit Length: 2
--------------------------------------------------------------------------
CodeLength Bit Codes: Block: 0 hclen: (16 - 4) = 12
--------------------------------------------------------------------------
Code: 2 Count: 1 BitCode: 110 Bit Length: 3
Code: 3 Count: 7 BitCode: 00 Bit Length: 2
Code: 4 Count: 2 BitCode: 111 Bit Length: 3
Code: 17 Count: 4 BitCode: 01 Bit Length: 2
Code: 18 Count: 5 BitCode: 10 Bit Length: 2
--------------------------------------------------------------------------
--------------------------------------------------------------------------
多余的位没有反转。
您的问题是长度为 2 的单个距离代码是不允许的。单个距离代码的长度必须为 1。来自 RFC 1951:
If only one distance code is used, it is encoded using one bit, not zero bits; in this case there is a single code length of one, with one unused code.