询问字母连词的unicode

Question

我在解析PDF文档时偶尔会遇到一些特殊字符。它们实际上是两个英文字母，如'fi'、'tt'或'ti'，但在视觉上它们看起来像是连词，实际上它们作为一个字符存在于PDF字符串中。

我检查了 'ToUnicode' 这些字符，但我只是发现 'ToUnicode' CMap table 被破坏了，因此我找不到它们的 unicode。

例如，<012E> Tj 将像附图一样打印fi。但是在其对应的Font的ToUnicode CMap中：<012E> <0001>，没有意义。

有人可以告诉我他们的 unicode 代码点吗？可以从对应的字体程序中找到吗？

感谢任何建议。

fi:

tt:

ti:

Answer 1

首先，你所谓的字母连词通常被称为连字。因此，从现在开始，我将在此处使用该术语。

Unicode 不鼓励对连字使用特定代码点：

The existing ligatures exist basically for compatibility and round-tripping with non-Unicode character sets. Their use is discouraged. No more will be encoded in any circumstances.

Ligaturing is a behavior encoded in fonts: if a modern font is asked to display “h” followed by “r”, and the font has an “hr” ligature in it, it can display the ligature. Some fonts have no ligatures, while others (especially fonts for non-Latin scripts) have hundreds of ligatures. It does not make sense to assign Unicode code points to all these font-specific possibilities.

(Unicode FAQ on ligatures)

因此，您不应使用现有的连字代码点。

您似乎试图为连字字形找到正确的 ToUnicode 映射。为此，请记住 ToUnicode 映射的值不需要是 单个代码点 ，但可以是多个 :

n beginbfchar
srcCode dstString
endbfchar
where dstString may be a string of up to 512 bytes.

（ISO 32000-1，第 9.10.3 节 ToUnicode CMap）

关于你的例子，因此：

For example, <012E> Tj will print fi like attached picture. But in its corresponding Font's ToUnicode CMap: <012E> <0001>, which is meaningless.

简单使用

<012E> <00660069>

如果您仍然想使用连字代码点，请查询 Wikipedia article on Orthographic Ligatures，它列出了一些连字代码点。特别是 <FB01> 对应 ﬁ，因此对于您的示例：

<012E> <FB01>

但请记住，不鼓励使用它们。

询问字母连词的unicode

asking for the unicode of letter conjunctions

pdf

unicode