如何手动构造一个 gzip 以便压缩文件大于原始文件？

How to manually construct a gzip so that compressed file is larger than original?

假设一个名为 data.bin 的 1KB 文件，如果可以构建它的 gzip data.bin.gz，但要大得多，该怎么做？

我们理论上可以将 GZIP 格式扩大多少？

你可以让它任意大。获取任何 gzip 文件并在五个字节中插入尽可能多的重复：00 00 00 ff ff 在 gzip header 之后和压缩数据之前。

总结：

具有header fields/general结构：除非遇到软件限制，否则效果是无限的
空块：不受格式规范影响
未压缩块：效果限制为 6x
压缩块：以明显的方式，最大效果估计为 1.125x，很难实现

采用 gzip 格式 (RFC1952 (metadata), RFC1951 (deflate format), additional notes for GNU gzip) 并尽情使用它。

Header

有一大堆地方可以利用：

使用可选字段（原始文件名、文件注释、额外字段）
直截了当地附加垃圾（GNU gzip解压时会发出警告）
连接多个 gzip 存档（格式允许，生成的未压缩数据同样是连接或所有块）。
- 一个有趣的副作用（显然是 GNU gzip 中的一个错误）：gzip -l 仅从最后一个块（即使它是垃圾）中获取报告的未压缩大小，而不是将所有值相加。所以你可以使它 看起来 就像存档（荒谬地） larger/smaller 比原始数据。

这些是显而易见的，您也许可以找到其他方法。

数据

“deflate”格式的总体布局是 (RFC1951):

A compressed data set consists of a series of blocks, corresponding to successive blocks of input data. The block sizes are arbitrary, except that non-compressible blocks are limited to 65,535 bytes.
<...>
Each block consists of two parts: a pair of Huffman code trees that describe the representation of the compressed data part, and a compressed data part. (The Huffman trees themselves are compressed using Huffman encoding.) The compressed data consists of a series of elements of two types: literal bytes (of strings that have not been detected as duplicated within the previous 32K input bytes), and pointers to duplicated strings, where a pointer is represented as a pair <length, backward distance>. The representation used in the "deflate" format limits distances to 32K bytes and lengths to 258 bytes, but does not limit the size of a block, except for uncompressible blocks, which are limited as noted above.

满块

00 00 00 ff ff 即 Mark Adler suggests 本质上是一个空的 non-final 块（RFC1951 第 3.2.3 节。对于第一个字节，3.2.4。对于未压缩的块本身） .
_{顺便说一句，根据gzip overview at the official site和源码，马克是解压部分的作者...}

未压缩块

使用 non-empty 个未压缩的块（参考上一节），您最多可以为每个符号创建一个。因此，效果仅限于 6 倍。

压缩块

一言以蔽之：有些inflation是可以实现的，但是难度很大，能实现的效果有限。除非你有充分的理由，否则不要在他们身上浪费你的时间。

在压缩块内（第 3.2.5 节），每个块都是 [<encoded character(8-9 bits>|<encoded chunk length (7-11 bits)><distance back to data(5-18 bits)>]，长度从 3 开始。7-9 位代码明确解析为文字字符或特定长度范围.较长的代码对应较大的 lengths/distances。块之间不允许 space/meaningless 内容。
因此原始字节块的最大值为 9/8 (1.125x) - 如果 all 原始字节的代码为 144 - 255.
玩参考块对你没有任何好处：即使是对 3 字节序列的参考也最多只能给出 25/24 (1.04x)。

这就是静态霍夫曼表。查看有关动态文档的文档，它针对特定数据或其他内容优化了上述编码。因此，它应该允许使给定数据的比率更接近可实现的最大值，仅此而已。

如何手动构造一个 gzip 以便压缩文件大于原始文件？

How to manually construct a gzip so that compressed file is larger than original?

gzip

huffman-code

data-compression

Header

数据

满块

未压缩块

压缩块