根据pdf_reference_1-7，ToUnicode CMap比encoding有更高的优先级，但是我这里有一个相反的文件，我该怎么办

Question

源代码：(\037)Tj

CID 31根据编码差异应映射为'✓'which is right

基础编码：WinAnsiEncoding

差异：[31, uni2713]

根据 ToUnicode CMap，CID 31 映射为“3”是错误的

地图：

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS) /Supplement 0 >> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
2 beginbfchar
<1F> <0033>
<0020> <0020>
endbfchar
endcmap CMapName currentdict /CMap defineresource pop end end

Answer 1

在 PDF 格式中，呈现和文本提取是两个独立的路径。与说 HTML 不同，在 PDF 中，这是两个独立的操作。

根据您提供的内容，页面内容流的字符代码为\037（八进制）。对于渲染，使用编码，其中 Differences 是其中的一部分，因此使用字体编码 uni2713 处的字形索引。

但是，对于文本提取，使用了 ToUnicode CMap。您可以通过在各种 PDF 阅读器中打开 PDF 并将文本复制+粘贴到文本编辑器中来验证这一点。

原因是虽然字符代码只能映射到特定字体中的一个字形，但相同的字符代码可以映射到多个 unicode 值。说 U+FB01（连字）。

find that there is no program good enough to extract text and tables from pdf.

您可能没有看到我所在的公司开发的文字和table提取工具。 https://www.pdftron.com/document-understanding

https://www.pdftron.com/pdf-tools/pdf-table-extraction

根据pdf_reference_1-7，ToUnicode CMap比encoding有更高的优先级，但是我这里有一个相反的文件，我该怎么办

According to pdf_reference_1-7, ToUnicode CMap have a greater priority than encoding, but here I have a contrary file, what should I do

pdf

pdfminer