从扫描的 PDF 中提取文本而不将扫描保存为新文件图像

Question

我想从扫描的 PDF 中提取文本。
我的“测试”代码如下：

from pdf2image import convert_from_path
from pytesseract import image_to_string
from PIL import Image

converted_scan = convert_from_path('test.pdf', 500)

for i in converted_scan:
    i.save('scan_image.png', 'png')
    
text = image_to_string(Image.open('scan_image.png'))
with open('scan_text_output.txt', 'w') as outfile:
    outfile.write(text.replace('\n\n', '\n'))

我想知道是否有一种方法可以直接从对象 converted_scan 中提取图像的内容，而无需将扫描结果另存为磁盘上的新“物理”图像文件？

基本上，我想跳过这部分：

for i in converted_scan:
    i.save('scan_image.png', 'png')

我有几千个扫描件可以从中提取文本。虽然所有生成的新图片文件都不是特别大，但也不是可以忽略不计，我觉得有点大材小用了。

编辑

这里有一个与 Colonder 的答案略有不同、更紧凑的方法，基于 this post。对于包含许多页面的 .pdf 文件，可能值得使用例如为每个循环添加一个进度条tqdm 模块。

from wand.image import Image as w_img
from PIL import Image as p_img
import pyocr.builders
import regex, pyocr, io

infile = 'my_file.pdf'
tool = pyocr.get_available_tools()[0]
tool = tools[0]
req_image = []
txt = ''

# to convert pdf to img and extract text
with w_img(filename = infile, resolution = 200) as scan:
    image_png = scan.convert('png')
    for i in image_png.sequence:
        img_page = w_img(image = i)
        req_image.append(img_page.make_blob('png'))
    for i in req_image:
        content = tool.image_to_string(
            p_img.open(io.BytesIO(i)),
            lang = tool.get_available_languages()[0],
            builder = pyocr.builders.TextBuilder()
        )
        txt += content

# to save the output as a .txt file
with open(infile[:-4] + '.txt', 'w') as outfile:
    full_txt = regex.sub(r'\n+', '\n', txt)
    outfile.write(full_txt)

Answer 1

2021 年 5 月更新
我意识到虽然 pdf2image 只是调用一个子进程，但不必保存图像以随后对它们进行 OCR。您可以做的很简单（您也可以使用 pytesseract 作为 OCR 库）

from pdf2image import convert_from_path

for img in convert_from_path("some_pdf.pdf", 300):
    txt = tool.image_to_string(img,
                               lang=lang,
                               builder=pyocr.builders.TextBuilder())

编辑：你也可以尝试使用pdftotext库

pdf2image 是 pdftoppm 和 pdftocairo 的简单包装。它在内部只做调用子进程。这个脚本应该做你想做的，但你需要一个wand library as well as pyocr（我认为这是一个偏好问题，所以随意使用任何库来提取你想要的文本）。

from PIL import Image as Pimage, ImageDraw
from wand.image import Image as Wimage
import sys
import numpy as np
from io import BytesIO

import pyocr
import pyocr.builders

def _convert_pdf2jpg(in_file_path: str, resolution: int=300) -> Pimage:
    """
    Convert PDF file to JPG

    :param in_file_path: path of pdf file to convert
    :param resolution: resolution with which to read the PDF file
    :return: PIL Image
    """
    with Wimage(filename=in_file_path, resolution=resolution).convert("jpg") as all_pages:
        for page in all_pages.sequence:
            with Wimage(page) as single_page_image:
                # transform wand image to bytes in order to transform it into PIL image
                yield Pimage.open(BytesIO(bytearray(single_page_image.make_blob(format="jpeg"))))

tools = pyocr.get_available_tools()
if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)
# The tools are returned in the recommended order of usage
tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))
# Ex: Will use tool 'libtesseract'

langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[0]
print("Will use lang '%s'" % (lang))
# Ex: Will use lang 'fra'
# Note that languages are NOT sorted in any way. Please refer
# to the system locale settings for the default language
# to use.
for img in _convert_pdf2jpg("some_pdf.pdf"):
    txt = tool.image_to_string(img,
                               lang=lang,
                               builder=pyocr.builders.TextBuilder())

从扫描的 PDF 中提取文本而不将扫描保存为新文件图像

Extracting text from scanned PDF without saving the scan as a new file image

python

ocr