使用 python 突出显示 pdf 文件中的文本内容并保存屏幕截图

Question

我有一个 pdf 文件列表，我需要在这些文件的每一页上突出显示特定文本，并为每个文本实例保存一个快照。

到目前为止，我可以突出显示文本并将 pdf 文件的整个页面保存为快照。但是，我想找到突出显示文本的位置并放大快照，与整页快照相比会更详细。

我很确定这个问题一定有解决方案。我是 Python 的新手，因此找不到它。如果有人能帮我解决这个问题，我将不胜感激。

我曾尝试使用 PyPDF2、Pymupdf 库，但我无法找到解决方案。我还尝试通过提供有效的坐标来突出显示，但找不到将这些坐标作为输出的方法。

[![Sample snapshot from the code[![\]\[1\]][1]][1]][1]

#import PyPDF2
import os
import fitz
from wand.image import Image
import csv
#import re
#from pdf2image import convert_from_path

check = r'C:\Users\Pradyumna.M\Desktop\Pradyumna\Automation\Intel Bytes\Create Source Docs\Sample Check 8 Apr 2019'

dir1 = check + '\Source Docs\'
dir2 = check + '\Output\'

dir = [dir1, dir2]

for x in dir:
    try:
        os.mkdir(x)
    except FileExistsError:
        print("Directory ", x, " already exists")

### READ PDF FILE
with open('upload1.csv', newline='') as myfile:
    reader = csv.reader(myfile)
    for row in reader:
        rowarray = '; '.join(row)
        src = rowarray.split("; ")
        file = check + '\' + src[4] + '.pdf'
        print(file)
        #pdfFileObj = open(file,'rb')
        #pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
        #print("Total number of pages: " + str(pdfReader.numPages))
        doc = fitz.open(file)
        print(src[5])
        for i in range(int(src[5])-1, int(src[5])):
            i = int(i)
            page = doc[i]
            print("Processing page: " + str(i))
            text = src[3]
            #SEARCH TEXT
            print("Searching: " + text)
            text_instances = page.searchFor(text)
            for inst in text_instances:
                highlight = page.addHighlightAnnot(inst)
                file1 = check + '\Output\' + src[4] + '_output.pdf'
                print(file1)
                doc.save(file1, garbage=4, deflate=True, clean=True)
                ### Screenshot
                with(Image(filename=file1, resolution=150)) as source:
                    images = source.sequence
                    newfilename = check + "\Source Docs\" + src[0] + '.jpeg'
                    Image(images[i]).save(filename=newfilename)
                    print("Screenshot of " + src[0] + " saved")

Answer 1

"无法找到将这些坐标作为输出的方法" - 你可以通过这样做得到坐标：

for inst in text_instances:
    print(inst)

inst 是 fitz.Rect 对象，其中包含找到的文本片段的左上角和右下角坐标。 docs.

中提供了所有信息

我设法突出显示点并使用以下代码片段保存裁剪区域。我正在使用 python 3.7.1，fitz.version 的输出是 ('1.14.13', '1.14.0', '20190407064320')。

import fitz

doc = fitz.open("foo.pdf")
inst_counter = 0
for pi in range(doc.pageCount):
    page = doc[pi]

    text = "hello"
    text_instances = page.searchFor(text)

    five_percent_height = (page.rect.br.y - page.rect.tl.y)*0.05

    for inst in text_instances:
        inst_counter += 1
        highlight = page.addHighlightAnnot(inst)

        # define a suitable cropping box which spans the whole page 
        # and adds padding around the highlighted text
        tl_pt = fitz.Point(page.rect.tl.x, max(page.rect.tl.y, inst.tl.y - five_percent_height))
        br_pt = fitz.Point(page.rect.br.x, min(page.rect.br.y, inst.br.y + five_percent_height))
        hl_clip = fitz.Rect(tl_pt, br_pt)

        zoom_mat = fitz.Matrix(2, 2)
        pix = page.getPixmap(matrix=zoom_mat, clip = hl_clip)
        pix.writePNG(f"pg{pi}-hl{inst_counter}.png")

doc.close()

我用 "hello" 穿插的示例 pdf 对此进行了测试：

脚本的一些输出：

我根据文档的以下页面编写了解决方案：

Tutorial 引入图书馆的页面
page.searchFor找出searchFor方法的return类型
fitz.Rect 了解 page.searchFor 中的 returned 对象是什么
Collection of Recipes 页面（在 URL 中称为常见问题解答）了解如何裁剪和保存部分 pdf 页面

使用 python 突出显示 pdf 文件中的文本内容并保存屏幕截图

Highlight text content in pdf files using python and save a screenshot

python

pdf

pypdf2