如何从pdf中提取所有文本?
How to extract all text from pdf?
我正在使用 PYPDF2 库从 PDF 中提取文本,但在执行循环时遇到问题。
我正在使用下面的代码,我可以从第一页中提取一个字符串。
from PyPDF2 import PdfFileReader
reader = PdfFileReader("mypdf.pdf")
# Print number of pages
num_page = reader.getNumPages()
print(num_page)
# Print the number of pages where [0] is the first page
page = reader.pages[0]
print(page.extractText())
我想使用通过 .GetNumPages()
获得的页码并迭代 reader.pages[0]
的次数
我正在尝试打印 99 页的代码:
from PyPDF2 import PdfFileReader reader = PdfFileReader("mypdf.pdf")
# Print number of pages num_page = reader.getNumPages() print(num_page)
# Print the number of pages where [0] is the first page
page = reader.pages[0] i = 0 print(type(num_page)) print(type(i)) for i in page:
if i < num_page:
page = reader.pages[i]
print(page.extractText())
i = i + 1
else:
print("done")
发生错误:
Traceback (most recent call last):
File "/home/wilian/PycharmProjects/ExtractText/pypdf.py", line 13, in <module>
if i < num_page:
TypeError: '<' not supported between instances of 'NameObject' and 'int'
99
<class 'int'>
<class 'int'>
Process finished with exit code 1
尝试简单的范围循环
例子
from PyPDF2 import PdfFileReader
def pdf_info():
with open("my_pdf.pdf", "rb") as f:
reader = PdfFileReader(f)
for i in range(reader.getNumPages()):
print(i)
# page = reader.pages[i]
# print(page.extractText())
if __name__ == '__main__':
pdf_info()
我正在使用 PYPDF2 库从 PDF 中提取文本,但在执行循环时遇到问题。
我正在使用下面的代码,我可以从第一页中提取一个字符串。
from PyPDF2 import PdfFileReader reader = PdfFileReader("mypdf.pdf") # Print number of pages num_page = reader.getNumPages() print(num_page) # Print the number of pages where [0] is the first page page = reader.pages[0] print(page.extractText())
我想使用通过 .GetNumPages()
获得的页码并迭代 reader.pages[0]
我正在尝试打印 99 页的代码:
from PyPDF2 import PdfFileReader reader = PdfFileReader("mypdf.pdf")
# Print number of pages num_page = reader.getNumPages() print(num_page)
# Print the number of pages where [0] is the first page
page = reader.pages[0] i = 0 print(type(num_page)) print(type(i)) for i in page:
if i < num_page:
page = reader.pages[i]
print(page.extractText())
i = i + 1
else:
print("done")
发生错误:
Traceback (most recent call last):
File "/home/wilian/PycharmProjects/ExtractText/pypdf.py", line 13, in <module>
if i < num_page:
TypeError: '<' not supported between instances of 'NameObject' and 'int'
99
<class 'int'>
<class 'int'>
Process finished with exit code 1
尝试简单的范围循环
例子
from PyPDF2 import PdfFileReader
def pdf_info():
with open("my_pdf.pdf", "rb") as f:
reader = PdfFileReader(f)
for i in range(reader.getNumPages()):
print(i)
# page = reader.pages[i]
# print(page.extractText())
if __name__ == '__main__':
pdf_info()