使用 pypdf2 从 chrome 的打印选项生成的 pdf 文件中提取文本

Question

正在尝试使用 python(v 3.8.2) 模块 pypdf2(v 1.26.0) 从 pdf file/s 中提取文本。一切都很好，除了特定的 pdf file/s（从 chrome 打印选项生成。）

我在 generated/downloaded 使用 chrome 的打印选项期间拥有这些文件，其中有一个选项可以将 page/document 另存为 pdf。我无法从这些 pdf 文件中提取文本作为代码 returns ' '（空），其他 pdf 文件没有问题。如果您想测试自己，您可以使用 chrome 打印选项将任何网页另存为 pdf 格式，然后使用该 pdf 格式进行测试。 Chrome(v 81.0.4044.138)

发现chrome使用Skia将页面另存为pdf但对解决问题没有帮助。（PDF 制作人：Skia/PDF m80）

在 Stack Overflow 上发现了类似的问题，但还没有人回答，因为我是新用户，所以我无法评论或添加任何内容，因此出现了这个新问题。

Extract text from pdf converted from webpage using Pypdf2

代码如下

import PyPDF2
pdfFileObj = open('example.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())
pdfFileObj.close()

我是新用户，第一次发帖，如有不妥之处请指正（不知道有没有）。我向您保证，我已经在 google 上进行了搜索，但没有找到解决方案或缺乏理解 problem/solution 的知识。谢谢

Answer 1

PyPDF2 从 pdf 中提取文本非常不可靠。正如 here 所指出的那样。其中说：

While PyPDF2 has .extractText(), which can be used on its page objects (not shown in this example), it does not work very well. Some PDFs will return text and some will return an empty string. When you want to extract text from a PDF, you should check out the PDFMiner project instead. PDFMiner is much more robust and was specifically designed for extracting text from PDFs.

看我对类似问题的回答here

使用 pypdf2 从 chrome 的打印选项生成的 pdf 文件中提取文本

Extract text from pdf file genrated by chrome's print option using pypdf2

python

pdf

extraction

skia

pypdf2