将 PDF 转换为图像的省时方法

Question

上下文：

我有正在处理的 PDF 文件。我正在使用 ocr 从这些文档中提取文本，为了能够做到这一点，我必须将我的 pdf 文件转换为图像。我目前使用 pdf2image 模块的 convert_from_path 函数，但它的时间效率非常低（9 页 pdf 需要 9 分钟）。

问题：

我正在寻找加速此过程的方法或其他将我的 PDF 文件转换为图像的方法。

附加信息：

我知道函数中有一个 thread_count 参数，但在多次尝试后似乎没有任何区别。

这是我正在使用的全部功能：

def pdftoimg(fic,output_folder):
# Store all the pages of the PDF in a variable 
pages = convert_from_path(fic, dpi=500,output_folder=output_folder,thread_count=9, poppler_path=r'C:\Users\Vincent\Documents\PDF\poppler-21.02.0\Library\bin') 

image_counter = 0

# Iterate through all the pages stored above 
for page in pages: 
    filename = "page_"+str(image_counter)+".jpg"
    page.save(output_folder+filename, 'JPEG') 
    image_counter = image_counter + 1
    
for i in os.listdir(output_folder):
    if i.endswith('.ppm'):
        os.remove(output_folder+i)

Link 到 convert_from_path 参考。

Answer 1

我使用另一个名为 fitz 的模块找到了该问题的答案，该模块是 python 绑定到 MuPDF。

首先安装PyMuPDF：

可以找到文档 here 但对于 windows 用户来说它相当简单：

pip install PyMuPDF

然后导入fitz模块：

import fitz
print(fitz.__doc__)

>>>PyMuPDF 1.18.13: Python bindings for the MuPDF 1.18.0 library.
>>>Version date: 2021-05-05 06:32:22.
>>>Built for Python 3.7 on win32 (64-bit).

打开文件并将每一页保存为图片：

get_pixmap() 方法接受不同的参数，允许您控制图像（变化、分辨率、颜色...）所以我建议您红色文档 here。

def convert_pdf_to_image(fic):
    #open your file
    doc = fitz.open(fic)
    #iterate through the pages of the document and create a RGB image of the page
    for page in doc:
        pix = page.get_pixmap()
        pix.save("page-%i.png" % page.number)

希望这对其他人有帮助。

将 PDF 转换为图像的省时方法

Time efficient way to convert PDF to image

python

pdf

ocr

image

pdf2image