PDMiner 缺失周期

Question

我要提取这个PDF的文字内容：https://www.welivesecurity.com/wp-content/uploads/2019/07/ESET_Okrum_and_Ketrican.pdf

这是我的代码：

import os
import re
from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage

def get_pdf_text(path):
    rsrcmgr = PDFResourceManager()
    with StringIO() as outfp, open(path, 'rb') as fp:
        device = TextConverter(rsrcmgr, outfp)
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.get_pages(fp, check_extractable=True):
            interpreter.process_page(page)
        device.close()
        text = re.sub('\s+', ' ', outfp.getvalue())
    return text

if __name__ == '__main__':
    path = './ESET_Okrum_and_Ketrican.pdf'
    print(get_pdf_text(path))

但是在提取的文本中，缺少一些句点字符：

is a threat group believed to be operating out of China Its attacks were first reported in 2012, when the group used a remote access trojan (RAT) known as Mirage to attack high-profile targets around the world However, the group’s activities were traced back to at least 2010 in FireEye’s 2013 report on operation Ke3chang – a cyberespionage campaign directed at diplomatic organizations and missions in Europe The attackers resurfaced

这真的让我很烦，因为我正在对提取的文本进行自然语言处理，没有句号整个文档被认为是一个大句子。

我强烈怀疑是因为 PDF 的 /ToUnicode 映射包含错误数据，因为我在 PDF.js 中遇到了同样的问题。我读过说每当 PDF 的 /ToUnicode 映射错误时，如果不进行 OCR 就无法正确提取其文本。

但我也一直在使用 pdf2htmlEX 和 PDFium（Chrome 的 PDF 渲染器），它们都可以很好地提取 PDF 的所有字符（至少对于这个 PDF , 即).

例如，当我将此 PDF 提供给 pdf2htmlEX 时，它检测到 /ToUnicode 数据有误，并删除了新字体：

所以我的问题是，PDFMiner 是否可以使用与 pdf2htmlEX 和 PDFium 相同的功能，并且允许正确提取 PDF 的所有字符，即使有错误的 /ToUnicode 数据？

感谢您的帮助。

Answer 1

我认为这无法修复，因为该工具没有任何错误。经查，PDF写出了一个真实的句号，使用的指令是：

(.) Tj

(.) 代表字符 0x2E（这也是 Unicode 中句点（或“句号”）的正确字符）。

然而，使用的字体有一个 ToUnicodeMap（是的！），但它似乎将句点映射到错误的字符（糟糕！）：

<2E> <0020>

所以句点字符映射到 0x0020 字符，等等，space。

所以你的选择是找到一个可以在 Unicode Map 中为这个字体修复这个问题的工具（我不知道有什么），或者改用 OCR 之类的工具。

Answer 2

实际上PDF与this answer中检查的PDF相似：

根据手头字体的 Encoding 条目，它使用常规 WinAnsiEncoding 编码从 0x20 开始向上，因此代码 0x2E 将代表句点字符。
正如@David 在他的回答中已经指出的那样，ToUnicode 映射映射到 U+0020，常规 space 字符。
在页面内容流中，使用了另一种将绘制文本映射到 Unicode 的机制，使用 ActualText 属性标记内容，例如如果是 OP 引用的提取文本：
```
(, also known as APT15, is a threat group believed to be operating out of\
 China)Tj
/Span<</ActualText<FEFF002E>>> BDC 
(.)Tj
EMC  
```
即(.)Tj 中的 0x2E（= ASCII 中的“.”）代码，根据 Encoding 表示一个句点，该句点又由 ToUnicode 映射更正为表示一个 space 字符，标记为 实际上 在 UTF16 Unicode 中表示 0xFEFF002E，这是一个 BOM 和一个句点字符。

因此，

文本提取器只看到字体的 Encoding 将 0x2E 视为句点（很可能 pdf2htmlEX 就是这种情况，明确忽略 ToUnicode 在这里映射);
文本提取器也看到 ToUnicode 映射但看不到 ActualText 标记的文本属性将 0x2e 视为 space（就像pdfminer一样）；
文本提取器还看到 ActualText 标记的文本属性将 0x2E 视为句点（例如 Adobe Reader 复制和粘贴）。

这种对某些文本提取器的明显误导通常是为了进行自动文本提取（大多数此类自动文本提取器使用 ToUnicode 而不是 ActualText) 提取不正确，同时仍允许从 Adobe Reader.

进行复制和粘贴

PDMiner 缺失周期

PDMiner missing periods

python

pdf

unicode

pdfminer