如何使用 PyPDF2 正确使用返回的 PageObject 到 extractText()

How to properly use the returned PageObject to extractText() with PyPDF2

我尝试使用 PyPDF2Python3 从给定文件中搜索关键字。函数searchFromFile(path:str,keyword:str) -> List[PageObject]如下:

def searchFromFile(path:str,keyword:str) -> List[PageObject]:
  pdf = pypdf.PdfFileReader(open(path, "rb"))
  if pdf.isEncrypted:
    pdf.decrypt('')
  numberOfPages = pdf.getNumPages()
  result = [PageObject]
  for pageNumber in range(0,numberOfPages):
    page = pdf.getPage(pageNumber)
    text = page.extractText()
    if keyword in text:
        result.append(page)
  return result

if __name__ == '__main__':
  resultList = searchFromFile(sys.argv[1], sys.argv[2])
  for page in resultList:
    print("page content:",page.extractText())

return 类型是 PageObject 的列表,因此我可以像上面的代码一样在 main 中使用 PageObject 的方法。但是出现以下错误:

Traceback (most recent call last):
File"C:\Users\...\git\python\venv\pdftool\SearchFromPdf.py", line 28, in <module>
print("page content:",page.extractText())
TypeError: extractText() missing 1 required positional argument: 'self'

问题:如何解决这个错误?

您可以确认此代码有效吗?

#sudo apt-get install python3-pypdf2

import PyPDF2 as pypdf

def searchFromFile(path:str,keyword:str):
  pdf = pypdf.PdfFileReader(open(path, "rb"))
  if pdf.isEncrypted:
    pdf.decrypt('')
  numberOfPages = pdf.getNumPages()
  # ~ result = [PageObject]
  result = []
  for pageNumber in range(0,numberOfPages):
    print ("page",pageNumber,"/",numberOfPages)
    page = pdf.getPage(pageNumber)
    text = page.extractText()
    if keyword in text:
        result.append(page)
  return result

if __name__ == '__main__':
  resultList = searchFromFile(sys.argv[1], sys.argv[2])
  for page in resultList:
    print("page content:",page.extractText())