从pdf中读取图像并从中提取文本

Question

问题陈述：我有一个包含 n 页的 pdf，每页有 1 张图像，我需要阅读其文本并执行一些操作。

我尝试了什么： 我必须在 python 中执行此操作，我找到的唯一结果最好的库是 pytesserac。我正在粘贴我尝试过的示例代码

    fn = kw['fn'] = self.env.context.get('wfg_pg', kw['fn'])
    zoom, zoom_config = self.get_zoom_for_doc(index), ' -c tessedit_do_invert=0'
    if 3.3 < zoom < 3.5:
        zoom_config += ' --oem 3 --psm 4'
    elif 0 != page_number_list[0]:
        zoom_config += ' --psm 6'
    full_text, page_length = '', kw['doc'].pageCount
    if recursion and index >= 10:
        return fn.get('most_correct') or fn.get(page_number_list[0])
    mat = fitz.Matrix(zoom, zoom)  # increase resolution
    for page_no in page_number_list:
        page = kw['doc'].loadPage(page_no)  # number of page
        pix = page.getPixmap(matrix=mat)
        with Image.open(io.BytesIO(pix.getImageData())) as img:
            text_of_each_page = str(pytesseract.image_to_string(img, config='%s' % zoom_config)).strip()
        fn[page_no] = text_of_each_page
        full_text = '\n'.join((full_text, text_of_each_page, '\n'))
    _logger.critical(f"full text in load immage {full_text}")
    args = (full_text, page_number_list)
    load = recursion and self.run_recursion_to_load_new_image_to_text(*args, **kw)
    if recursion and load:
        return self.load_image
    return full_text

问题：我的 pdf 有 1/13、1/7 这样的日期，图书馆将它们读取为 143、1n，在某些地方，它将 17 读取为 1)。同样在文本之后，它也会随机给出一些符号，例如 { & . , = 而在 pdf 中甚至没有这些东西。

为了准确

1. I tried converting the image to .tiff format but it didn't work for me.
2. Tried adjusting the resolution of the image.

Answer 1

您可以使用 pdftoppm 工具来非常快速地转换图像，因为它让您只需传递 thread_count=(no of threads) 即可使用 multi-threading 功能。您可以参考此 link 了解有关此工具的更多信息。此外，更好的图像可以提高 tesseract 的准确性。

从pdf中读取图像并从中提取文本

Reading images from pdf and extract Text from it

text-extraction

python-3.x

python-tesseract

image-text

python-pdfreader