如何从 python 上的 PDF 文件中提取一个单词的多个实例？

Question

我正在 python 上编写一个脚本来读取 PDF 文件并记录在提到“时间”的每个实例之后出现的字符串及其提到的页码。

我已经让它识别出每个页面上何时都有字符串“时间”并将页码发送给我，但是如果该页面多次出现“时间”，它不会告诉我。我假设这是因为它已经满足了至少有一次字符串“时间”的标准，因此它跳到下一页执行检查。

我将如何找到“时间”一词的多个实例？

这是我的代码：

import PyPDF2

def pdf_read():
    pdfFile = "records\document.pdf"
    
    pdf = PyPDF2.PdfFileReader(pdfFile)
    pageCount = pdf.getNumPages()
    
    for pageNumber in range(pageCount):
        page = pdf.getPage(pageNumber)
        pageContent = page.extractText()   
        if "Time" in pageContent or "time" in pageContent:
            print(pageNumber)

另请注意，此 pdf 是扫描文档，因此当我阅读 python 上的文本（或复制并粘贴到 word 上）时，有很多单词会出现多个随机符号和字符，即使它完全清晰。这是计算机编程的局限性，无需应用机器学习等更复杂的概念即可准确读取文件吗？

Answer 1

一种解决方案是在 pageContent 之外创建一个字符串列表，并计算列表中单词 'time' 的出现频率。 select 'time' 之后的单词也更容易 - 您可以简单地检索列表中的下一项：

import PyPDF2
import string

pdfFile = "records\document.pdf"

pdf = PyPDF2.PdfFileReader(pdfFile)
pageCount = pdf.getNumPages()

for pageNumber in range(pageCount):
    page = pdf.getPage(pageNumber)
    pageContent = page.extractText()   
    pageContent = ''.join(pageContent.splitlines()).split() # words to list
    pageContent = ["".join(j.lower() for j in i if j not in string.punctuation) for i in pageContent] # remove punctuation

    print(pageContent.count('time') + pageContent.count('Time')) # count occurances of time in list
    print([(j, pageContent[i+1] if i+1 < len(pageContent) else '') for i, j in enumerate(pageContent) if j == 'Time' or j == 'time']) # list time and following word

请注意，此示例还会从非字母或数字的字符中去除所有单词。希望这足以清除不良的 OCR。

如何从 python 上的 PDF 文件中提取一个单词的多个实例？

How to extract multiple instances of a word from PDF files on python?

python

pdf

pdf-reader

python-3.x