如何从pdf中提取文本框并将其转换为图像

How to extract text boxes from a pdf and convert them to image

我正在尝试从包含文本的 pdf 中获取裁剪框,这对于为我的一个模型收集训练数据非常有用,这就是我需要它的原因。这是一个pdf示例: https://github.com/tomasmarcos/tomrep/blob/tomasmarcos-example2delete/example%20-%20Git%20From%20Bottom%20Up.pdf;例如,我想将第一个框文本作为图像(jpg 或其他)获取,如下所示:

到目前为止我尝试的是以下代码,但我愿意以其他方式解决此问题,所以如果您有其他方式,那很好。 此代码是我在此处 How to extract text and text coordinates from a PDF file? 找到的解决方案(第一个答案)的修改版本; (只有我的代码的第一部分);第二部分是我尝试过但到目前为止没有成功的方法,我也尝试用 pymupdf 读取图像但根本没有改变任何东西(我不会 post 这种尝试,因为 post足够大)。

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
import pdfminer
import os
import pandas as pd
import pdf2image
import numpy as np
import PIL
from PIL import Image
import io

# pdf path 
pdf_path ="example - Git From Bottom Up.pdf"

# PART 1: GET LTBOXES COORDINATES IN THE IMAGE
# Open a PDF file.
fp = open(pdf_path, 'rb')

# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)

# Create a PDF document object that stores the document structure.
# Password for initialization as 2nd parameter
document = PDFDocument(parser)

# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
    raise PDFTextExtractionNotAllowed

# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()

# Create a PDF device object.
device = PDFDevice(rsrcmgr)

# BEGIN LAYOUT ANALYSIS
# Set parameters for analysis.
laparams = LAParams()

# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)

# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)


# here is where i stored the data
boxes_data = []
page_sizes = []

def parse_obj(lt_objs, verbose = 0):
    # loop over the object list
    for obj in lt_objs:
        # if it's a textbox, print text and location
        if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
            if verbose >0:
                print("%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.get_text()))
            data_dict = {"startX":round(obj.bbox[0]),"startY":round(obj.bbox[1]),"endX":round(obj.bbox[2]),"endY":round(obj.bbox[3]),"text":obj.get_text()}
            boxes_data.append(data_dict)
        # if it's a container, recurse
        elif isinstance(obj, pdfminer.layout.LTFigure):
            parse_obj(obj._objs)

# loop over all pages in the document
for page in PDFPage.create_pages(document):
    # read the page into a layout object
    interpreter.process_page(page)
    layout = device.get_result()
    # extract text from this object
    parse_obj(layout._objs)
    mediabox = page.mediabox
    mediabox_data = {"height":mediabox[-1], "width":mediabox[-2]}
    page_sizes.append(mediabox_data)

代码的第二部分,获取图像格式的裁剪框。

# PART 2: NOW GET PAGE TO IMAGE
firstpage_size = page_sizes[0]
firstpage_image = pdf2image.convert_from_path(pdf_path,size=(firstpage_size["height"],firstpage_size["width"]))[0]
#show first page with the right size (at least the one that pdfminer says)
firstpage_image.show()

#first box data
startX,startY,endX,endY,text = boxes_data[0].values()
# turn image to array
image_array = np.array(firstpage_image)
# get cropped box
box = image_array[startY:endY,startX:endX,:]
convert2pil_image = PIL.Image.fromarray(box)
#show cropped box image
convert2pil_image.show()
#print this does not match with the text, means there's an error
print(text)

如您所见,框的坐标与图像不匹配,问题可能是因为 pdf2image 对图像大小或类似的东西做了一些技巧,但我正确指定了图像的大小,所以我不知道。 任何解决方案/建议都非常受欢迎。 提前致谢。

我检查了您代码第一部分中前两个框的坐标,它们或多或少适合页面上的文本:

但是您知道 PDF 中的零点位于 bottom-left 角吗?也许这是问题的原因。

不幸的是我没能测试代码的第二部分。 pdf2image 给我一些错误。

但我几乎可以肯定 PIL.Image 在 top-left 角中的点为零,这与 PDF 不同。您可以使用公式将 pdf_Y 转换为 pil_Y:

pil_Y = page_height - pdf_Y

您的页面高度为 792 磅。您也可以通过脚本获取页面高度。

坐标


更新

尽管如此,我花了几个小时来安装所有模块(这是最难的部分!)我让你的脚本在一定程度上工作。

基本上我是对的:坐标倒置了y => h - y因为PIL和PDF的零点位置不同

还有一件事。 PIL 制作分辨率为 200 dpi 的图像(可能它可以在某处更改)。 PDF 以点为单位衡量一切(1 pt = 1/72 dpi)。所以如果你想在 PIL 中使用 PDF 大小,你需要这样更改 PDF 大小:x => x * 200 / 72.

固定代码如下:

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
import pdfminer
import os
import pandas as pd
import pdf2image
import numpy as np
import PIL
from PIL import Image
import io
from pathlib import Path # it's just my favorite way to handle files

# pdf path
# pdf_path ="test.pdf"
pdf_path = Path.cwd()/"Git From Bottom Up.pdf"


# PART 1: GET LTBOXES COORDINATES IN THE IMAGE ----------------------
# Open a PDF file.
fp = open(pdf_path, 'rb')

# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)

# Create a PDF document object that stores the document structure.
# Password for initialization as 2nd parameter
document = PDFDocument(parser)

# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
    raise PDFTextExtractionNotAllowed

# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()

# Create a PDF device object.
device = PDFDevice(rsrcmgr)

# BEGIN LAYOUT ANALYSIS
# Set parameters for analysis.
laparams = LAParams()

# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)

# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)


# here is where i stored the data
boxes_data = []
page_sizes = []

def parse_obj(lt_objs, verbose = 0):
    # loop over the object list
    for obj in lt_objs:
        # if it's a textbox, print text and location
        if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
            if verbose >0:
                print("%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.get_text()))
            data_dict = {
                "startX":round(obj.bbox[0]),"startY":round(obj.bbox[1]),
                "endX":round(obj.bbox[2]),"endY":round(obj.bbox[3]),
                "text":obj.get_text()}
            boxes_data.append(data_dict)
        # if it's a container, recurse
        elif isinstance(obj, pdfminer.layout.LTFigure):
            parse_obj(obj._objs)

# loop over all pages in the document
for page in PDFPage.create_pages(document):
    # read the page into a layout object
    interpreter.process_page(page)
    layout = device.get_result()
    # extract text from this object
    parse_obj(layout._objs)
    mediabox = page.mediabox
    mediabox_data = {"height":mediabox[-1], "width":mediabox[-2]}
    page_sizes.append(mediabox_data)

# PART 2: NOW GET PAGE TO IMAGE -------------------------------------
firstpage_size = page_sizes[0]
firstpage_image = pdf2image.convert_from_path(pdf_path)[0] # without 'size=...'
#show first page with the right size (at least the one that pdfminer says)
# firstpage_image.show()
firstpage_image.save("firstpage.png")

# the magic numbers
dpi = 200/72
vertical_shift = 5 # I don't know, but it's need to shift a bit
page_height = int(firstpage_size["height"] * dpi)

# loop through boxes (we'll process only first page for now)
for i, _ in enumerate(boxes_data):

    #first box data
    startX, startY, endX, endY, text = boxes_data[i].values()

    # correction PDF --> PIL
    startY = page_height - int(startY * dpi) - vertical_shift
    endY   = page_height - int(endY   * dpi) - vertical_shift
    startX = int(startX * dpi)
    endX   = int(endX   * dpi)
    startY, endY = endY, startY 

    # turn image to array
    image_array = np.array(firstpage_image)
    # get cropped box
    box = image_array[startY:endY,startX:endX,:]
    convert2pil_image = PIL.Image.fromarray(box)
    #show cropped box image
    # convert2pil_image.show()
    png = "crop_" + str(i) + ".png"
    convert2pil_image.save(png)
    #print this does not match with the text, means there's an error
    print(text)

代码和你的差不多。我只是添加了坐标校正并保存了 PNG 文件而不是显示它们。

输出:

Gi from the bottom up

Wed,  Dec 9

by John Wiegley

In my pursuit to understand Git, it’s been helpful for me to understand it from the bottom
up — rather than look at it only in terms of its high-level commands. And since Git is so beauti-
fully simple when viewed this way, I thought others might be interested to read what I’ve found,
and perhaps avoid the pain I went through nding it.

I used Git version 1.5.4.5 for each of the examples found in this document.

1.  License
2.  Introduction
3.  Repository: Directory content tracking

Introducing the blob
Blobs are stored in trees
How trees are made
e beauty of commits
A commit by any other name…
Branching and the power of rebase
4.  e Index: Meet the middle man

Taking the index farther
5.  To reset, or not to reset

Doing a mixed reset
Doing a so reset
Doing a hard reset

6.  Last links in the chain: Stashing and the reog
7.  Conclusion
8.  Further reading

2
3
5
6
7
8
10
12
15
20
22
24
24
24
25
27
30
31

当然固定代码更像是原型。不作为产品销售。 )