PDF 中无法识别的字形(summationdisplay、summationtext)
Unrecognised glyphs in PDF (summationdisplay, summationtext)
我正在尝试使用 pdf-reader gem 处理 PDF。基本上没问题,但是在应该有求和符号的地方,我得到的是 \u0001
而不是 \u2211
。相关的字体对象是:
{:Type=>:Font,
:Subtype=>:Type1,
:FirstChar=>1,
:LastChar=>2,
:Widths=>[1444, 1056],
:Encoding=>{:Type=>:Encoding, :Differences=>[1, :summationdisplay, :summationtext]},
:BaseFont=>:"APHKGN+CMEX10",
:FontDescriptor=>
{:Type=>:FontDescriptor,
:Ascent=>0,
:CapHeight=>0,
:Descent=>0,
:Flags=>4,
:FontBBox=>[0, -1400, 1387, 0],
:FontName=>:"APHKGN+CMEX10",
:ItalicAngle=>0,
:StemV=>47,
:StemH=>47,
:CharSet=>"/summationdisplay/summationtext",
:FontFile3=>
#<PDF::Reader::Stream:0x007faab138a528
@data=
"H\x89bd`ab`dd\xE4s\f\xF0\xF0v\xF7\xD3v\xF6u\x8D04\x00\x89(\xFD\x90e\xFC!\xCE\xF2C\x8EG\xACX\xE6K\x81\f\xEB\xBA\x9F3X\xBF;\xF1\x7Fw\x13\xF8\xEE%\xB8\xE2\x87\xA7\x10\x03\vP\x9F\rfqinnbIf~^IjE\t\x9C\x93\x92Y\\x90\x93X\xE9\x9C_PY\x94\x99\x9EQ\xA2\xA0\xE1\xAC\xA9`hii\xAE\xE0\x98\x9BZ\x94\x99\x9C\x98\xA7\xE0\x9BX\x92\x91\nR\x9D\x9C\x98\xA3\x10\x9C\x9F\x9C\x99ZR\xA9\xA7\xE0\x98\x93\xA3\x10\x04\xD2Q\xAC\x10\x94Z\x9CZT\x96\x9A\x02u\x15\xD0Y\xED\x8C\fL\x01\x11\f\xCC\x8C\x8C\xECE?\xFF3\xFA\x86\x86\xF1\xFDg\x91\xEFO\xF8Ws\xE8\x97\xECf\xC6\x1F\xD5\x7Ff\x88N\x9A\xD2\xDB\xD7/\xD5\xDF\xD5\xD3:E\xEE\xF7\xCD\x1FA\xAC?\x14\xD8\xBE\xB3}\xAFj\xF9\xED\x7FQ~\t\x9B\xE9\xF7:\xD6\xBF\x17\xD9\n\xBA\xBAr\xE4\x7F0\xFE\xE9\xFA\xFD\xFD\x8F7kscWg\xBBT\xC3\x94\xEE\xB9r?/\xB2=\xFC\xDE\xCBZ\xC4V\xE4\xE0\xE1g\x96\xC7\xD1V\xEDV\xFC[]\xFA\x8F-\e\xDF\x7F\xD6%\x85'd~u<\x92a\xF9\xB8\x9BQ\x86\xE5\x13\x90-\xFA\x9D\xF7\xFB\x15\xA0\xEA\x14eE\xF7\xDF\xEC\xB9\x1Cme\x9A\x85\xBFC\xA4\xFF\xBCg\xFB1\xF1\xC7K\xD6I\x93{\xFB&H\xF5v\xF7\xB5L\x95\xFB\x93\xF6S\x90\xF5\xC7\x0E\xB6\xEFR\xCFj;\xA7\xC8\x1Fl~Tu+rI\xF5\xF9\xB8\xB5V\x1CK\xD8~\xF3~_\xCB*\xF3;\x89\xAD\xA4\xAB\xAB\xB5C\xBE\xAB\xA3\xBB\xA2A\xEA\xC7\xD2\xBF\x19\x7Ff\xFD\xF9\xCC\xDAX\xDF\xDD\xD6\x05q _\xF9|6\x99\xDF\x95\xF3\xD9\xE5\x16\xB8O\x9D9\xE3?\x0F\xE7.\xAE]\xDC\x9B'\xF1\xF0\x001/@\x80\x01\x00J\xBC\xBFN\n",
@hash={:Filter=>:FlateDecode, :Length=>464, :Subtype=>:Type1C},
@udata=nil>}}
由于 Adobe glyphlist.txt
(复制于 pdf-reader/lib/pdf/reader/glyphlist.txt
)仅包含 summation
,而不包含 summationtext
或 summationdisplay
,因此 @differences
不要没有应用到 PDF::Reader::Encoding#differences=
中的 @mapping
,并且 @state.current_font.to_utf8(1)
无法获取正确的字形(它 returns 字形代码作为后备,这就是为什么我最终得到 \u0001
). IE。 PDF 字体对象中的字体映射差异应该(根据我的理解)按名称引用主字形列表中的字形,但这两个不匹配。
我错过了什么?如果 summationdisplay
和 summationtext
不在 Adobe 的 glyphlist.txt
上,其他 PDF reader 如何正确呈现此字体?
这是用自定义编码和非标准字形名称定义字体子集。此外,它不包括来自自定义编码的 ToUnicode
反向映射。
PDF-32000 Specification涵盖了这个场景:
9.10 Extraction of Text Content
9.10.1 General
...
When extracting character content, a conforming reader can easily convert text to Unicode values if a font’s characters are identified according to a standard character set that is known to the conforming reader. This character identification can occur if either the font uses a standard named encoding or the characters in the font are identified by standard character names or CIDs in a well-known collection. 9.10.2, "Mapping Character Codes to Unicode Values", describes in detail the overall algorithm for mapping character codes to Unicode values.
If a font is not defined in one of these ways, the glyphs can still be shown, but the characters cannot be converted to Unicode values without additional information:
• This information can be provided as an optional ToUnicode entry in the font dictionary (PDF 1.2; see 9.10.3, "ToUnicode CMaps"), whose value shall be a stream object containing a special kind of CMap file that maps character codes to Unicode values.
pdf-reader
似乎确实符合上述。有一个自定义子集编码,/summationdisplay
映射到 \u0001
。有足够的信息来呈现,但不能将字体反向映射回 Unicode。
我正在尝试使用 pdf-reader gem 处理 PDF。基本上没问题,但是在应该有求和符号的地方,我得到的是 \u0001
而不是 \u2211
。相关的字体对象是:
{:Type=>:Font,
:Subtype=>:Type1,
:FirstChar=>1,
:LastChar=>2,
:Widths=>[1444, 1056],
:Encoding=>{:Type=>:Encoding, :Differences=>[1, :summationdisplay, :summationtext]},
:BaseFont=>:"APHKGN+CMEX10",
:FontDescriptor=>
{:Type=>:FontDescriptor,
:Ascent=>0,
:CapHeight=>0,
:Descent=>0,
:Flags=>4,
:FontBBox=>[0, -1400, 1387, 0],
:FontName=>:"APHKGN+CMEX10",
:ItalicAngle=>0,
:StemV=>47,
:StemH=>47,
:CharSet=>"/summationdisplay/summationtext",
:FontFile3=>
#<PDF::Reader::Stream:0x007faab138a528
@data=
"H\x89bd`ab`dd\xE4s\f\xF0\xF0v\xF7\xD3v\xF6u\x8D04\x00\x89(\xFD\x90e\xFC!\xCE\xF2C\x8EG\xACX\xE6K\x81\f\xEB\xBA\x9F3X\xBF;\xF1\x7Fw\x13\xF8\xEE%\xB8\xE2\x87\xA7\x10\x03\vP\x9F\rfqinnbIf~^IjE\t\x9C\x93\x92Y\\x90\x93X\xE9\x9C_PY\x94\x99\x9EQ\xA2\xA0\xE1\xAC\xA9`hii\xAE\xE0\x98\x9BZ\x94\x99\x9C\x98\xA7\xE0\x9BX\x92\x91\nR\x9D\x9C\x98\xA3\x10\x9C\x9F\x9C\x99ZR\xA9\xA7\xE0\x98\x93\xA3\x10\x04\xD2Q\xAC\x10\x94Z\x9CZT\x96\x9A\x02u\x15\xD0Y\xED\x8C\fL\x01\x11\f\xCC\x8C\x8C\xECE?\xFF3\xFA\x86\x86\xF1\xFDg\x91\xEFO\xF8Ws\xE8\x97\xECf\xC6\x1F\xD5\x7Ff\x88N\x9A\xD2\xDB\xD7/\xD5\xDF\xD5\xD3:E\xEE\xF7\xCD\x1FA\xAC?\x14\xD8\xBE\xB3}\xAFj\xF9\xED\x7FQ~\t\x9B\xE9\xF7:\xD6\xBF\x17\xD9\n\xBA\xBAr\xE4\x7F0\xFE\xE9\xFA\xFD\xFD\x8F7kscWg\xBBT\xC3\x94\xEE\xB9r?/\xB2=\xFC\xDE\xCBZ\xC4V\xE4\xE0\xE1g\x96\xC7\xD1V\xEDV\xFC[]\xFA\x8F-\e\xDF\x7F\xD6%\x85'd~u<\x92a\xF9\xB8\x9BQ\x86\xE5\x13\x90-\xFA\x9D\xF7\xFB\x15\xA0\xEA\x14eE\xF7\xDF\xEC\xB9\x1Cme\x9A\x85\xBFC\xA4\xFF\xBCg\xFB1\xF1\xC7K\xD6I\x93{\xFB&H\xF5v\xF7\xB5L\x95\xFB\x93\xF6S\x90\xF5\xC7\x0E\xB6\xEFR\xCFj;\xA7\xC8\x1Fl~Tu+rI\xF5\xF9\xB8\xB5V\x1CK\xD8~\xF3~_\xCB*\xF3;\x89\xAD\xA4\xAB\xAB\xB5C\xBE\xAB\xA3\xBB\xA2A\xEA\xC7\xD2\xBF\x19\x7Ff\xFD\xF9\xCC\xDAX\xDF\xDD\xD6\x05q _\xF9|6\x99\xDF\x95\xF3\xD9\xE5\x16\xB8O\x9D9\xE3?\x0F\xE7.\xAE]\xDC\x9B'\xF1\xF0\x001/@\x80\x01\x00J\xBC\xBFN\n",
@hash={:Filter=>:FlateDecode, :Length=>464, :Subtype=>:Type1C},
@udata=nil>}}
由于 Adobe glyphlist.txt
(复制于 pdf-reader/lib/pdf/reader/glyphlist.txt
)仅包含 summation
,而不包含 summationtext
或 summationdisplay
,因此 @differences
不要没有应用到 PDF::Reader::Encoding#differences=
中的 @mapping
,并且 @state.current_font.to_utf8(1)
无法获取正确的字形(它 returns 字形代码作为后备,这就是为什么我最终得到 \u0001
). IE。 PDF 字体对象中的字体映射差异应该(根据我的理解)按名称引用主字形列表中的字形,但这两个不匹配。
我错过了什么?如果 summationdisplay
和 summationtext
不在 Adobe 的 glyphlist.txt
上,其他 PDF reader 如何正确呈现此字体?
这是用自定义编码和非标准字形名称定义字体子集。此外,它不包括来自自定义编码的 ToUnicode
反向映射。
PDF-32000 Specification涵盖了这个场景:
9.10 Extraction of Text Content
9.10.1 General
...
When extracting character content, a conforming reader can easily convert text to Unicode values if a font’s characters are identified according to a standard character set that is known to the conforming reader. This character identification can occur if either the font uses a standard named encoding or the characters in the font are identified by standard character names or CIDs in a well-known collection. 9.10.2, "Mapping Character Codes to Unicode Values", describes in detail the overall algorithm for mapping character codes to Unicode values.
If a font is not defined in one of these ways, the glyphs can still be shown, but the characters cannot be converted to Unicode values without additional information:
• This information can be provided as an optional ToUnicode entry in the font dictionary (PDF 1.2; see 9.10.3, "ToUnicode CMaps"), whose value shall be a stream object containing a special kind of CMap file that maps character codes to Unicode values.
pdf-reader
似乎确实符合上述。有一个自定义子集编码,/summationdisplay
映射到 \u0001
。有足够的信息来呈现,但不能将字体反向映射回 Unicode。