如何从pdf中提取所有文本?

How to extract all text from pdf?

我正在使用 PYPDF2 库从 PDF 中提取文本,但在执行循环时遇到问题。

我正在使用下面的代码,我可以从第一页中提取一个字符串。

from PyPDF2 import PdfFileReader
reader = PdfFileReader("mypdf.pdf")
# Print number of pages
num_page = reader.getNumPages()
print(num_page)
# Print the number of pages where [0] is the first page
page = reader.pages[0]
print(page.extractText())

我想使用通过 .GetNumPages() 获得的页码并迭代 reader.pages[0]

的次数

我正在尝试打印 99 页的代码:

from PyPDF2 import PdfFileReader reader = PdfFileReader("mypdf.pdf")
# Print number of pages num_page = reader.getNumPages() print(num_page)
# Print the number of pages where [0] is the first page

page = reader.pages[0] i = 0 print(type(num_page)) print(type(i)) for i in page:
    if i < num_page:
        page = reader.pages[i]
        print(page.extractText())
        i = i + 1
    else:
        print("done")

发生错误:

Traceback (most recent call last):
  File "/home/wilian/PycharmProjects/ExtractText/pypdf.py", line 13, in <module>
    if i < num_page:
TypeError: '<' not supported between instances of 'NameObject' and 'int'
99
<class 'int'>
<class 'int'>

Process finished with exit code 1

尝试简单的范围循环

例子

from PyPDF2 import PdfFileReader


def pdf_info():
    with open("my_pdf.pdf", "rb") as f:
        reader = PdfFileReader(f)
        for i in range(reader.getNumPages()):
            print(i)
            # page = reader.pages[i]
            # print(page.extractText())


if __name__ == '__main__':
    pdf_info()