使用 Python 删除 pdf 文件中的页面
Removing pages in a pdf file conditioning on something using Python
我有一个大约有 1000 页的 PDF 文件,我想删除一些以找不到特定单词为条件的页面。例如,代码会搜索特定的词,例如“STACKOVER”,如果在页面上找不到该词,则删除该页面并继续下一页,最后保存文件。
这样做的方法是:首先,定义您要查找的搜索词(在我的例子中,我在医学期刊上进行了测试并搜索了 searchwords=['unclear risk for poorly']
)。其次,查找包含该单词或字符串的所有页面,并将页码存储在列表中 (pages_to_delete
)。为了安全起见,我将它们放在一个 csv 文件中,给出了找到特定搜索词的页面。三、打开原pdf,删除列表中包含的页面,另存为新pdf
import PyPDF2
import re
from PyPDF2 import PdfFileWriter, PdfFileReader
pdfFileObj=open(r'C:\Users\s-degossondevarennes\......\dddtest.pdf',mode='rb')
pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
number_of_pages=pdfReader.numPages
pages_text=[]
words_start_pos={}
words={}
searchwords=['unclear risk for poorly']
pages_to_delete = []
with open('Pages.csv', 'w') as f:
f.write('{0},{1}\n'.format("Sheet Number", "Search Word"))
for word in searchwords:
for page in range(number_of_pages):
print(page)
pages_text.append(pdfReader.getPage(page).extractText())
words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page].lower())]
words[page]=[pages_text[page][value:value+len(word)] for value in words_start_pos[page]]
for page in words:
for i in range(0,len(words[page])):
if str(words[page][i]) != 'nan':
f.write('{0},{1}\n'.format(page+1, words[page][i]))
pages_to_delete.append(page)
infile = PdfFileReader(r'C:\Users\s-degossondevarennes\.......\dddtest.pdf', 'rb')
output = PdfFileWriter()
for i in range(infile.getNumPages()):
if i not in pages_to_delete:
p = infile.getPage(i)
output.addPage(p)
with open('Newdddtest.pdf', 'wb') as f:
output.write(f)
更新
如果您想忽略文本是否为粗体替换
words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page].lower())]
和
words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page])]
我有一个大约有 1000 页的 PDF 文件,我想删除一些以找不到特定单词为条件的页面。例如,代码会搜索特定的词,例如“STACKOVER”,如果在页面上找不到该词,则删除该页面并继续下一页,最后保存文件。
这样做的方法是:首先,定义您要查找的搜索词(在我的例子中,我在医学期刊上进行了测试并搜索了 searchwords=['unclear risk for poorly']
)。其次,查找包含该单词或字符串的所有页面,并将页码存储在列表中 (pages_to_delete
)。为了安全起见,我将它们放在一个 csv 文件中,给出了找到特定搜索词的页面。三、打开原pdf,删除列表中包含的页面,另存为新pdf
import PyPDF2
import re
from PyPDF2 import PdfFileWriter, PdfFileReader
pdfFileObj=open(r'C:\Users\s-degossondevarennes\......\dddtest.pdf',mode='rb')
pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
number_of_pages=pdfReader.numPages
pages_text=[]
words_start_pos={}
words={}
searchwords=['unclear risk for poorly']
pages_to_delete = []
with open('Pages.csv', 'w') as f:
f.write('{0},{1}\n'.format("Sheet Number", "Search Word"))
for word in searchwords:
for page in range(number_of_pages):
print(page)
pages_text.append(pdfReader.getPage(page).extractText())
words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page].lower())]
words[page]=[pages_text[page][value:value+len(word)] for value in words_start_pos[page]]
for page in words:
for i in range(0,len(words[page])):
if str(words[page][i]) != 'nan':
f.write('{0},{1}\n'.format(page+1, words[page][i]))
pages_to_delete.append(page)
infile = PdfFileReader(r'C:\Users\s-degossondevarennes\.......\dddtest.pdf', 'rb')
output = PdfFileWriter()
for i in range(infile.getNumPages()):
if i not in pages_to_delete:
p = infile.getPage(i)
output.addPage(p)
with open('Newdddtest.pdf', 'wb') as f:
output.write(f)
更新
如果您想忽略文本是否为粗体替换
words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page].lower())]
和
words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page])]