使用 Python 删除 pdf 文件中的页面

Removing pages in a pdf file conditioning on something using Python

我有一个大约有 1000 页的 PDF 文件,我想删除一些以找不到特定单词为条件的页面。例如,代码会搜索特定的词,例如“STACKOVER”,如果在页面上找不到该词,则删除该页面并继续下一页,最后保存文件。

这样做的方法是:首先,定义您要查找的搜索词(在我的例子中,我在医学期刊上进行了测试并搜索了 searchwords=['unclear risk for poorly'])。其次,查找包含该单词或字符串的所有页面,并将页码存储在列表中 (pages_to_delete)。为了安全起见,我将它们放在一个 csv 文件中,给出了找到特定搜索词的页面。三、打开原pdf,删除列表中包含的页面,另存为新pdf

import PyPDF2
import re
from PyPDF2 import PdfFileWriter, PdfFileReader

pdfFileObj=open(r'C:\Users\s-degossondevarennes\......\dddtest.pdf',mode='rb')
pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
number_of_pages=pdfReader.numPages

pages_text=[]
words_start_pos={}
words={}

searchwords=['unclear risk for poorly']

pages_to_delete = []

with open('Pages.csv', 'w') as f:
    f.write('{0},{1}\n'.format("Sheet Number", "Search Word"))
    for word in searchwords:
        for page in range(number_of_pages):
            print(page)
            pages_text.append(pdfReader.getPage(page).extractText())
            words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page].lower())]
            words[page]=[pages_text[page][value:value+len(word)] for value in words_start_pos[page]]
        for page in words:
            for i in range(0,len(words[page])):
                if str(words[page][i]) != 'nan':
                    f.write('{0},{1}\n'.format(page+1, words[page][i]))
                    pages_to_delete.append(page)
                    

infile = PdfFileReader(r'C:\Users\s-degossondevarennes\.......\dddtest.pdf', 'rb')
output = PdfFileWriter()

for i in range(infile.getNumPages()):
    if i not in pages_to_delete:
        p = infile.getPage(i)
        output.addPage(p)

with open('Newdddtest.pdf', 'wb') as f:
    output.write(f)

更新

如果您想忽略文本是否为粗体替换

words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page].lower())]

words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page])]