PDF 中无法识别的字形（summationdisplay、summationtext）

Question

我正在尝试使用 pdf-reader gem 处理 PDF。基本上没问题，但是在应该有求和符号的地方，我得到的是 \u0001 而不是 \u2211。相关的字体对象是：

{:Type=>:Font,
 :Subtype=>:Type1,
 :FirstChar=>1,
 :LastChar=>2,
 :Widths=>[1444, 1056],
 :Encoding=>{:Type=>:Encoding, :Differences=>[1, :summationdisplay, :summationtext]},
 :BaseFont=>:"APHKGN+CMEX10",
 :FontDescriptor=>
  {:Type=>:FontDescriptor,
   :Ascent=>0,
   :CapHeight=>0,
   :Descent=>0,
   :Flags=>4,
   :FontBBox=>[0, -1400, 1387, 0],
   :FontName=>:"APHKGN+CMEX10",
   :ItalicAngle=>0,
   :StemV=>47,
   :StemH=>47,
   :CharSet=>"/summationdisplay/summationtext",
   :FontFile3=>
    #<PDF::Reader::Stream:0x007faab138a528
     @data=
      "H\x89bd`ab`dd\xE4s\f\xF0\xF0v\xF7\xD3v\xF6u\x8D04\x00\x89(\xFD\x90e\xFC!\xCE\xF2C\x8EG\xACX\xE6K\x81\f\xEB\xBA\x9F3X\xBF;\xF1\x7Fw\x13\xF8\xEE%\xB8\xE2\x87\xA7\x10\x03\vP\x9F\rfqinnbIf~^IjE\t\x9C\x93\x92Y\\x90\x93X\xE9\x9C_PY\x94\x99\x9EQ\xA2\xA0\xE1\xAC\xA9`hii\xAE\xE0\x98\x9BZ\x94\x99\x9C\x98\xA7\xE0\x9BX\x92\x91\nR\x9D\x9C\x98\xA3\x10\x9C\x9F\x9C\x99ZR\xA9\xA7\xE0\x98\x93\xA3\x10\x04\xD2Q\xAC\x10\x94Z\x9CZT\x96\x9A\x02u\x15\xD0Y\xED\x8C\fL\x01\x11\f\xCC\x8C\x8C\xECE?\xFF3\xFA\x86\x86\xF1\xFDg\x91\xEFO\xF8Ws\xE8\x97\xECf\xC6\x1F\xD5\x7Ff\x88N\x9A\xD2\xDB\xD7/\xD5\xDF\xD5\xD3:E\xEE\xF7\xCD\x1FA\xAC?\x14\xD8\xBE\xB3}\xAFj\xF9\xED\x7FQ~\t\x9B\xE9\xF7:\xD6\xBF\x17\xD9\n\xBA\xBAr\xE4\x7F0\xFE\xE9\xFA\xFD\xFD\x8F7kscWg\xBBT\xC3\x94\xEE\xB9r?/\xB2=\xFC\xDE\xCBZ\xC4V\xE4\xE0\xE1g\x96\xC7\xD1V\xEDV\xFC[]\xFA\x8F-\e\xDF\x7F\xD6%\x85'd~u<\x92a\xF9\xB8\x9BQ\x86\xE5\x13\x90-\xFA\x9D\xF7\xFB\x15\xA0\xEA\x14eE\xF7\xDF\xEC\xB9\x1Cme\x9A\x85\xBFC\xA4\xFF\xBCg\xFB1\xF1\xC7K\xD6I\x93{\xFB&H\xF5v\xF7\xB5L\x95\xFB\x93\xF6S\x90\xF5\xC7\x0E\xB6\xEFR\xCFj;\xA7\xC8\x1Fl~Tu+rI\xF5\xF9\xB8\xB5V\x1CK\xD8~\xF3~_\xCB*\xF3;\x89\xAD\xA4\xAB\xAB\xB5C\xBE\xAB\xA3\xBB\xA2A\xEA\xC7\xD2\xBF\x19\x7Ff\xFD\xF9\xCC\xDAX\xDF\xDD\xD6\x05q _\xF9|6\x99\xDF\x95\xF3\xD9\xE5\x16\xB8O\x9D9\xE3?\x0F\xE7.\xAE]\xDC\x9B'\xF1\xF0\x001/@\x80\x01\x00J\xBC\xBFN\n",
     @hash={:Filter=>:FlateDecode, :Length=>464, :Subtype=>:Type1C},
     @udata=nil>}}

由于 Adobe glyphlist.txt（复制于 pdf-reader/lib/pdf/reader/glyphlist.txt）仅包含 summation，而不包含 summationtext 或 summationdisplay，因此 @differences 不要没有应用到 PDF::Reader::Encoding#differences= 中的 @mapping，并且 @state.current_font.to_utf8(1) 无法获取正确的字形（它 returns 字形代码作为后备，这就是为什么我最终得到 \u0001). IE。 PDF 字体对象中的字体映射差异应该（根据我的理解）按名称引用主字形列表中的字形，但这两个不匹配。

我错过了什么？如果 summationdisplay 和 summationtext 不在 Adobe 的 glyphlist.txt 上，其他 PDF reader 如何正确呈现此字体？

Answer 1

这是用自定义编码和非标准字形名称定义字体子集。此外，它不包括来自自定义编码的 ToUnicode 反向映射。

PDF-32000 Specification涵盖了这个场景：

9.10 Extraction of Text Content

9.10.1 General

...

When extracting character content, a conforming reader can easily convert text to Unicode values if a font’s characters are identified according to a standard character set that is known to the conforming reader. This character identification can occur if either the font uses a standard named encoding or the characters in the font are identified by standard character names or CIDs in a well-known collection. 9.10.2, "Mapping Character Codes to Unicode Values", describes in detail the overall algorithm for mapping character codes to Unicode values.

If a font is not defined in one of these ways, the glyphs can still be shown, but the characters cannot be converted to Unicode values without additional information:

• This information can be provided as an optional ToUnicode entry in the font dictionary (PDF 1.2; see 9.10.3, "ToUnicode CMaps"), whose value shall be a stream object containing a special kind of CMap file that maps character codes to Unicode values.

pdf-reader 似乎确实符合上述。有一个自定义子集编码，/summationdisplay 映射到 \u0001。有足够的信息来呈现，但不能将字体反向映射回 Unicode。

PDF 中无法识别的字形（summationdisplay、summationtext）

Unrecognised glyphs in PDF (summationdisplay, summationtext)

ruby

pdf

pdf-reader