DEFLATE 只能压缩最多 32 KiB 的重复字符串吗?
Can DEFLATE only compress duplicate strings up to 32 KiB apart?
根据DEFLATE spec:
- Compressed representation overview
A compressed data set consists of a series of blocks, corresponding to successive blocks of input
data. The block sizes are arbitrary, except that non-compressible
blocks are limited to 65,535 bytes.
Each block is compressed using a combination of the LZ77 algorithm and
Huffman coding. The Huffman trees for each block are independent of
those for previous or subsequent blocks; the LZ77 algorithm may use a
reference to a duplicated string occurring in a previous block, up to
32K input bytes before.
Each block consists of two parts: a pair of Huffman code trees that
describe the representation of the compressed data part, and a
compressed data part. (The Huffman trees themselves are compressed
using Huffman encoding.) The compressed data consists of a series of
elements of two types: literal bytes (of strings that have not been
detected as duplicated within the previous 32K input bytes), and
pointers to duplicated strings, where a pointer is represented as a
pair <length, backward distance>. The representation used in the
"deflate" format limits distances to 32K bytes and lengths to 258
bytes, but does not limit the size of a block, except for
uncompressible blocks, which are limited as noted above.
所以指向重复字符串的指针只能返回 32 KiB,但是由于块大小不受限制,霍夫曼代码树是否可以存储两个相隔超过 32 KiB 的重复字符串作为相同的代码?那么限制因素是块大小吗?
距离的霍夫曼树包含代码 0 到 29(下面的table);代码 29,后跟“普通”位中的 8191,表示“距离 32768”。这是 Deflate 定义中的一个硬性限制。块大小没有限制。实际上块大小没有存储在任何地方:块是一个无限流。如果你想停止该块,你为此发送一个块结束代码。
Distance Codes
--------------
Extra Extra Extra Extra
Code Bits Dist Code Bits Dist Code Bits Distance Code Bits Distance
---- ---- ---- ---- ---- ------ ---- ---- -------- ---- ---- --------
0 0 1 8 3 17-24 16 7 257-384 24 11 4097-6144
1 0 2 9 3 25-32 17 7 385-512 25 11 6145-8192
2 0 3 10 4 33-48 18 8 513-768 26 12 8193-12288
3 0 4 11 4 49-64 19 8 769-1024 27 12 12289-16384
4 1 5,6 12 5 65-96 20 9 1025-1536 28 13 16385-24576
5 1 7,8 13 5 97-128 21 9 1537-2048 29 13 24577-32768
6 2 9-12 14 6 129-192 22 10 2049-3072
7 2 13-16 15 6 193-256 23 10 3073-4096
补充一下 Zerte 的回答,对先前序列的引用与块或块边界无关。这样的引用可以在块内,跨块,引用的序列可以跨块边界。
根据DEFLATE spec:
- Compressed representation overview
A compressed data set consists of a series of blocks, corresponding to successive blocks of input data. The block sizes are arbitrary, except that non-compressible blocks are limited to 65,535 bytes.
Each block is compressed using a combination of the LZ77 algorithm and Huffman coding. The Huffman trees for each block are independent of those for previous or subsequent blocks; the LZ77 algorithm may use a reference to a duplicated string occurring in a previous block, up to 32K input bytes before.
Each block consists of two parts: a pair of Huffman code trees that describe the representation of the compressed data part, and a compressed data part. (The Huffman trees themselves are compressed using Huffman encoding.) The compressed data consists of a series of elements of two types: literal bytes (of strings that have not been detected as duplicated within the previous 32K input bytes), and pointers to duplicated strings, where a pointer is represented as a pair <length, backward distance>. The representation used in the "deflate" format limits distances to 32K bytes and lengths to 258 bytes, but does not limit the size of a block, except for uncompressible blocks, which are limited as noted above.
所以指向重复字符串的指针只能返回 32 KiB,但是由于块大小不受限制,霍夫曼代码树是否可以存储两个相隔超过 32 KiB 的重复字符串作为相同的代码?那么限制因素是块大小吗?
距离的霍夫曼树包含代码 0 到 29(下面的table);代码 29,后跟“普通”位中的 8191,表示“距离 32768”。这是 Deflate 定义中的一个硬性限制。块大小没有限制。实际上块大小没有存储在任何地方:块是一个无限流。如果你想停止该块,你为此发送一个块结束代码。
Distance Codes
--------------
Extra Extra Extra Extra
Code Bits Dist Code Bits Dist Code Bits Distance Code Bits Distance
---- ---- ---- ---- ---- ------ ---- ---- -------- ---- ---- --------
0 0 1 8 3 17-24 16 7 257-384 24 11 4097-6144
1 0 2 9 3 25-32 17 7 385-512 25 11 6145-8192
2 0 3 10 4 33-48 18 8 513-768 26 12 8193-12288
3 0 4 11 4 49-64 19 8 769-1024 27 12 12289-16384
4 1 5,6 12 5 65-96 20 9 1025-1536 28 13 16385-24576
5 1 7,8 13 5 97-128 21 9 1537-2048 29 13 24577-32768
6 2 9-12 14 6 129-192 22 10 2049-3072
7 2 13-16 15 6 193-256 23 10 3073-4096
补充一下 Zerte 的回答,对先前序列的引用与块或块边界无关。这样的引用可以在块内,跨块,引用的序列可以跨块边界。