使用预设的 deflate 字典来减少压缩存档文件的大小

Question

我需要将文本文件从一个位置发送到另一个位置。这两个地点都在我们的控制之下。内容的性质和其中可能出现的词语大多相同。这意味着，如果我将 delate dictionary 保留在两个位置一次，则无需将其与文件一起发送。

过去 1 周我一直在阅读有关此内容的文章，并尝试使用一些可用的代码，例如 this & this。

然而，我还是一头雾水

我还有几个问题：

我们可以根据预设的单词生成和使用自定义 deflate 词典吗？
我们可以在没有 deflate 字典的情况下发送文件并使用本地字典吗？
如果不是 gzip，是否有任何此类压缩库可用于此目的？

到目前为止我偶然发现的一些参考资料：

Answer 1

zlib 库支持 zlib（而非 gzip）格式的词典。参见 deflateSetDictionary() 和 inflateSetDictionary()。

字典的构造没有什么特别之处。它只是 32K 字节的字符串，您认为这些字符串会经常出现在您正在压缩的数据中。你应该把最常用的字符串放在32K的末尾。

Answer 2

下面是我找到的具体答案以及示例代码。

1.我们可以根据预设的单词生成和使用自定义 deflate 字典吗？

是的，这是可以做到的。 python 中的一个简单示例如下：

import zlib

#Data for compression
hello = b'hello'    

#Compress with dictionary
co = zlib.compressobj(wbits=-zlib.MAX_WBITS, zdict=hello)
compress_data = co.compress(hello) + co.flush()

2。我们可以发送一个没有 deflate 字典的文件并使用本地的吗？

是的，您可以只发送没有字典的数据。在上面的示例代码中，压缩数据位于 compress_data 中。但是，要解压缩，您需要在压缩过程中传递 zdict 值。解压示例：

hello = b'hello'  #for passing to zdict  
do = zlib.decompressobj(wbits=-zlib.MAX_WBITS, zdict=hello)
data = do.decompress(compress_data)

包含和不包含字典数据的完整示例代码：

import zlib

#Data for compression
hello = b'hello'

#Compression with dictionary
co = zlib.compressobj(wbits=-zlib.MAX_WBITS, zdict=hello)
compress_data = co.compress(hello) + co.flush()

#Compression without dictionary
co_nodict = zlib.compressobj(wbits=-zlib.MAX_WBITS, )
compress_data_nodict = co_nodict.compress(hello) + co_nodict.flush()

#De-compression with dictionary
do = zlib.decompressobj(wbits=-zlib.MAX_WBITS, zdict=hello)
data = do.decompress(compress_data)

#print compressed output when dict used
print(compress_data)

#print compressed output when dict not used
print(compress_data_nodict)

#print decompressed output when dict used
print(data)

以上代码不适用于 unicode 数据。对于 unicode 数据，您必须执行以下操作：

import zlib

#Data for compression
unicode_data = 'റെക്കോർഡ്'
hello = unicode_data.encode('utf-16be')

#Compression with dictionary
co = zlib.compressobj(wbits=-zlib.MAX_WBITS, zdict=hello)
compress_data = co.compress(hello) + co.flush()
...

基于JS的方法参考：

How to find a good/optimal dictionary for zlib 'setDictionary' when processing a given set of data?

使用预设的 deflate 字典来减少压缩存档文件的大小

Using a preset deflate dictionary to reduce compressed archive file size

javascript

python

compression

deflate

lossless-compression