zlib.error 解压时错误-3：不正确header 检查

Question

我有 collection 个 .pdf 文件，使用 pdf-parser.py 时会给出：FlateDecode 解压缩失败。 zlib.error解压缩时出现错误-3：不正确header检查。见下文。

    PDF Comment %PDF-1.4 
    PDF Comment %âãÏÓ
    obj 1 0
     Type: /ExtGState
     Referencing: 
    <<
    /Type/ExtGState
    /SA false
    /SM 0.02
    >>
      <<
        /Type /ExtGState
        /SA false
        /SM 0.02
      >>
    <<
    /Type/ExtGState
    /SA false
    /SM 0.02
    >>
    obj 2 0
     Type: 
     Referencing: 
    [/DeviceRGB]
    [/DeviceRGB]
    obj 3 0
     Type: 
     Referencing: 
     Contains stream
      <<
        /Filter /FlateDecode
        /Length 1136
      >>
     FlateDecode decompress failed. zlib.error Error -3 while decompressing: incorrect header check
...
...
<<
/Producer (tx_pdf 15.0.130.501)
/CreationDate (D:20100309081052Z)
>>

ZLIB header（在 RFC1950 中定义）应该是：

 CMF |  FLG
0x78 | 0x01 - No Compression/low
0x78 | 0x9C - Default Compression
0x78 | 0xDA - Best Compression

在 010 编辑器中检查文件时，header 字节改为 0x78 和 0xC3。见图：

有谁知道字节可能代表哪种压缩？我已经尝试 google 生产者 (/Producer (tx_pdf 15.0.130.501)) 没有结果。

Answer 1

显然，某些软件将您的 PDF 处理为采用某种 ANSI 编码的纯文本，然后使用 UTF-8 写回 "text"。这当然会扰乱每个二进制部分，例如Flate-encoded 压缩流。

在你的情况下，ZLIB header 已经损坏：

78 C3 9A ...

如果你用 UTF-8 解码你得到的字符

xÚ...

如果您以某种匹配的 Windows ANSI 编码对这些字符进行编码，您会得到

78 DA ...

这是最佳压缩 ZLIB header。

因此，您应该尝试撤消此 "text encoding change"。当然，问题仍然存在 究竟要使用哪种编码 ，因为有许多非常相似的编码类型，有些仅在单个字符上有所不同。

另一个问题是所讨论的软件是否不仅仅做了 "change the text encoding" - 将字节流作为文本处理的软件有时也会进行其他更改，例如将 "line endings" 统一到本地平台标准或解释或删除控制字符。这种额外的更改很可能会损坏无法修复的二进制文件。

稍后对提供的文件进行一些试验和错误，结果发现这里的 ANSI'ish 编码是 Microsoft 的 .Net Encoding class 的编码 windows-1252，幸运的是，有问题的程序似乎没有损坏数据。

因此，使用这几行

byte[] bytes = File.ReadAllBytes(@"rec1254.pdf");
byte[] converted = Encoding.Convert(UTF8Encoding.UTF8, Encoding.GetEncoding("windows-1252"), bytes);
File.WriteAllBytes(@"rec1254-utf8-to-windows-1252.pdf", converted);

我可以修复您的示例文件。在 Python.

中执行 re-encoding 应该同样简单

Answer 2

您的 PDF 很可能已加密。如果 Adobe Reader 打开它可能使用空密码，但内容仍然是加密的。

根据PDF 1.7 spec #7.6 Encryption：

a document can be encrypted to protect its contents from unauthorised access. Encryption applies to all strings and streams in the document's PDF file with the following exceptions:

The values for ID entry in the trailer

Any strings in Encrypt dictionary

Any strings that are inside streams such content streams and compressed objects streams, which themselves are encrypted

这意味着您需要在放气之前解密您的流。