PHP 过滤返回偏移字符的 FlateDecode PDF 流

PHP Filter FlateDecode PDF stream returning offset characters

我有使用 filetotext class 从 PDF 中提取文本的代码。一直工作到上周,当时生成的 pdf 发生了一些变化。奇怪的是,一旦我将 29 添加到字符的 ord 上,字符就在那里并且是正确的。

示例响应调试打印输出:

/F1 7.31 Tf
0 0 0 rg
1 0 0 1 195.16 597.4 Tm
($PRXQW)Tj
ET
BT

代码在 pdf 的流部分使用 gzuncompress。 $PRXQW 是金额,将 29dec 添加到每个字符的 ord 得到这个。但是有时候一个字符不会是这个准确的翻译,比如文中应该是a ) 貌似是5C66的两个字节。

只是想知道现在从 PDF 中出现这种代码环类型的字符,是否有人见过这种东西?

Tj 操作的字符串参数的编码完全取决于使用的 PDF 字体(F1 在手头的例子中) :

A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.

With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".

With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap, described in 9.7, "Composite Fonts".

(section 9.4.3 "Text-Showing Operators" in ISO 32000-1)

OP 的代码似乎采用标准编码,如 MacRomanEncodingWinAnsiEncoding,但这些只是特例。正如上面引述的那样,编码也可能是一些特别的混合多字节编码。

稍后部分的 PDF 规范描述了如何正确提取文本:

A conforming reader can use these methods, in the priority given, to map a character code to a Unicode value. Tagged PDF documents, in particular, shall provide at least one of these methods (see 14.8.2.4.2, "Unicode Mapping in Tagged PDF"):

  • If the font dictionary contains a ToUnicode CMap (see 9.10.3, "ToUnicode CMaps"), use that CMap to convert the character code to Unicode.

  • If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D):

    a) Map the character code to a character name according to Table D.1 and the font’s Differences array.

    b) Look up the character name in the Adobe Glyph List (see the Bibliography) to obtain the corresponding Unicode value.

  • If the font is a composite font that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection:

    a) Map the character code to a character identifier (CID) according to the font’s CMap.

    b) Obtain the registry and ordering of the character collection used by the font’s CMap (for example, Adobe and Japan1) from its CIDSystemInfo dictionary.

    c) Construct a second CMap name by concatenating the registry and ordering obtained in step (b) in the format registry–ordering–UCS2 (for example, Adobe–Japan1–UCS2).

    d) Obtain the CMap with the name constructed in step (c) (available from the ASN Web site; see the Bibliography).

    e) Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value.

If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.

(section 9.10.2 "Mapping Character Codes to Unicode Values" in ISO 32000-1)

因此:

Just wondering about this code ring type of character coming out of PDF's now and if anyone has seen this kind of thing?

是的,在 PDF 中使用完全不同于 ASCII 编码的文本绘制运算符字符串参数是很常见的。正如上面第二个引用的最后一段所暗示的那样,有些情况根本不允许文本提取(即没有 OCR),即使还有其他地方可以查找到 Unicode 的映射。

在大多数情况下,您要解码神秘字符串的是所选字体的 /Encoding 字段,在您的例子中是字体 /F1。编码方案很可能是 /Identity-H,它可以包含 PDF 字符串中的 16 位字符到 UTF-16 字符的任意映射。

这是我正在编写的 PDF 解析器的示例。每个页面包含一个资源字典,其中包含一个字体字典:

[&3|0] => Array [
   [/Type] => |/Page|
   [/Resources] => Array [
      [/Font] => Array [
         [/F1] => |&5|0|
         [/F2] => |&7|0|
         [/F3] => |&9|0|
         [/F4] => |&14|0|
         [/F5] => |&16|0|
      ]
   ]
   [/Contents] => |&4|0|
]

在我的例子中,/F3 生成了无法使用的文本,所以查看 /F3:

[&9|0] => Array [
    [/Type] => |/Font|
    [/Subtype] => |/Type0|
    [/BaseFont] => |/Arial|
    [/Encoding] => |/Identity-H|
    [/DescendantFonts] => |&10|0|
    [/ToUnicode] => |&96|0|
]

这里可以看到/Encoding type是/Identity-H。 /F3 中使用的解码字符的字符解码映射存储在 /ToUnicode 引用的流中。以下是 '&96|0' (96 0 R) 引用的流中的相关文本 - 其余部分作为样板文件被省略,可以忽略:

...
beginbfchar
<0003> <0020>
<000F> <002C>
<0015> <0032>
<001B> <0038>
<002C> <0049>
<003A> <0057>
endbfchar
...
beginbfrange
<0044> <0045> <0061>
<0047> <004C> <0064>
<004F> <0053> <006C>
<0055> <0059> <0072>
endbfrange
...
beginbfchar
<005C> <0079>
<00B1> <2013>
<00B6> <2019>
endbfchar
...

beginbfchar/endbfchar 之间的 16 位对是单个字符的映射。例如 <0003> (0x0003) 映射到 <0020> (0x0020),即 space 字符。

beginbfrange/endbfrange之间的16位三元组是字符范围的映射。例如,从 <0055>(第一个)到 <0059>(最后一个)的字符被映射到 <0072>、<0073>、<0074>、<0075> 和 <0076>('r' 到 'v' 在 UTF16 和 ASCII 中)。