Python 直接阅读 PDF 就像它在 PDF 中的样子

Question

如果我在这里使用答案中的代码： Extracting text from a PDF file using PDFMiner in python?

我可以在申请此 pdf 时获取要提取的文本：https://www.tencent.com/en-us/articles/15000691526464720.pdf

但是，您在 "CONSOLIDATED INCOME STATEMENT" 下看到，它向下读取......即...... Revenues VAS Online advertising 然后稍后它读取数字......我希望它读取，即：

Revenues 73,528 49,552 73,528 66,392 VAS 46,877 35,108 等等...有没有办法做到这一点？

正在寻找 pdfminer 以外的其他可能的解决方案。

如果我尝试将此代码用于 PyPDF2，甚至不会显示所有文本：

# importing required modules
import PyPDF2

# creating a pdf file object
pdfFileObj = open(file, 'rb')

# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

# printing number of pages in pdf file
a=(pdfReader.numPages)

# creating a page object
for i in range(0,a):
    pageObj = pdfReader.getPage(i)
    print(pageObj.extractText())

Answer 1

您的问题更多地与 PDF 文件的构建方式有关，而不是 pyPDF2 的问题。在解析 PDF 以重新构建页面布局时，我运行遇到了许多相同的问题。

生成 PDF 时，每个文本块都在页面上定位并根据应用的字体规则呈现（类似于仅使用绝对定位和 CSS 构建 HTML 文档）。一个简单的 PDF 库将简单地 return 每个块中的文本，按照它们在文件中定义的顺序（当页面以相反的方式生成时，我有文档，最后一段首先定义）。

要么您需要使用更高级的 PDF 库（可能是构建在简单库之上的库），该库将使用每个文本块的 X、Y 位置及其字体信息来确定垂直方向定位，或者自己开发这个。看起来 JosephA 所谈论的软件正是这样做的。

Answer 2

我首先查找了 extractText function of PyPDF2 并尝试 "strip" 输出中的任何新行，以便为您提供 "across" 单行页面。

输出不太理想...output

此外，就您的输出而言，它似乎并不可靠。来自 PyPDF2 文档： "Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated."

所以我去探索使用 Tesseract 的选项。所以这与使用 "pdf extraction library" 有点偏差，它基本上是 "build your own extraction script"。

一旦掌握了Tesseract，就不会太难了。我花了大约一个小时的时间研究 tesseract 的现有知识。

这是我自己的代码逐页提取 pdf 的结果：https://gist.github.com/Benehiko/60862a6be13b3b652b7d506121b95811

请注意我的代码有一个缺点。它不会按顺序提取页面。

以防万一 link 死亡：

from PIL import Image
import pytesseract
import subprocess
import pathlib
import glob
import os

pathlib.Path("pages").mkdir(parents=False, exist_ok=True)
params = ['convert', "-density", "300", 'test.pdf', '-depth', '8', 
'pages/test_%02d.tiff']

subprocess.check_call(params)

images = glob.glob("pages/*.tiff")
for image in images:
    image = Image.open(image)
    ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
    os.environ["TESSDATA_PREFIX"] = ROOT_DIR + "/tessdata"
    text = pytesseract.image_to_string(image, lang='eng', nice=0, 
    output_type=pytesseract.Output.STRING).replace("\n", " ")
    print(text)

代码解释：

这首先将 pdf 转换为单独的 "tiff" 图像，因为由于某种原因使用 pytesseract 读取多页 tiff 仅读取第一页。 tiff 文件保存在名为 "pages" 的单独目录中。 Pytesseract 读取每个文件，然后 returns 文本，然后使用“.replace”对其进行格式化，删除所有行并将文本格式化为一行。

起点： Tesseract install

在 python 中使用 tesseract: pytesseract

使用的训练数据： eng.traineddata

额外来源： pdf to tiff

Pytesseract: documentation

希望对您有所帮助。不确定这是否是您要找的东西。

Answer 3

您可以使用 PDFMiner 来完成这项工作，根据我的经验，它比其他开源 Python 工具更好用。

关键是 laparams 参数正确指定，而不是保留其默认值。此参数用于为 PDFMiner 提供有关页面 layout 的更多信息。由于此处的文本对应于 tables 的空格，我们需要指示 PDFMiner 使用较大的字符边距 (char_margin)。

布局的代码是 here。尝试为该特定文档提供最佳结果的超参数。

这是相关 pdf 的示例代码。我在这里只使用一个页面进行演示：

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path, pages):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'

    laparams=LAParams(all_texts=True, detect_vertical=True, 
                      line_overlap=0.5, char_margin=1000.0, #set char_margin to a large number
                      line_margin=0.5, word_margin=2,
                      boxes_flow=1)
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set(pages)

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

pdf_text_page6 = convert_pdf_to_txt("15000691526464720.pdf", pages=[6])

给定页面（文档中第 6 页对应第 7 页）的输出如下图所示。它并不完美，但 table 的所有数字部分都被捕获在与文本相同的行中。

Page 7 of 11 

  Unaudited    Unaudited 

  1Q2018  1Q2017   1Q2018  4Q2017 

Revenues  73,528  49,552   73,528  66,392 

    VAS   46,877  35,108   46,877  39,947 

   Online advertising   10,689  6,888   10,689  12,361 

    Others  15,962  7,556   15,962  14,084 

Cost of revenues  (36,486)  (24,109)   (36,486)  (34,897) 

Gross profit  37,042  25,443   37,042  31,495

Python 直接阅读 PDF 就像它在 PDF 中的样子

Python PDF read straight across as how it looks in the PDF

python

pdf

pdfminer

pypdf2