如何使用 PyPDF2 正确使用返回的 PageObject 到 extractText()
How to properly use the returned PageObject to extractText() with PyPDF2
我尝试使用 PyPDF2
和 Python3 从给定文件中搜索关键字。函数searchFromFile(path:str,keyword:str) -> List[PageObject]
如下:
def searchFromFile(path:str,keyword:str) -> List[PageObject]:
pdf = pypdf.PdfFileReader(open(path, "rb"))
if pdf.isEncrypted:
pdf.decrypt('')
numberOfPages = pdf.getNumPages()
result = [PageObject]
for pageNumber in range(0,numberOfPages):
page = pdf.getPage(pageNumber)
text = page.extractText()
if keyword in text:
result.append(page)
return result
if __name__ == '__main__':
resultList = searchFromFile(sys.argv[1], sys.argv[2])
for page in resultList:
print("page content:",page.extractText())
return 类型是 PageObject
的列表,因此我可以像上面的代码一样在 main
中使用 PageObject
的方法。但是出现以下错误:
Traceback (most recent call last):
File"C:\Users\...\git\python\venv\pdftool\SearchFromPdf.py", line 28, in <module>
print("page content:",page.extractText())
TypeError: extractText() missing 1 required positional argument: 'self'
问题:如何解决这个错误?
您可以确认此代码有效吗?
#sudo apt-get install python3-pypdf2
import PyPDF2 as pypdf
def searchFromFile(path:str,keyword:str):
pdf = pypdf.PdfFileReader(open(path, "rb"))
if pdf.isEncrypted:
pdf.decrypt('')
numberOfPages = pdf.getNumPages()
# ~ result = [PageObject]
result = []
for pageNumber in range(0,numberOfPages):
print ("page",pageNumber,"/",numberOfPages)
page = pdf.getPage(pageNumber)
text = page.extractText()
if keyword in text:
result.append(page)
return result
if __name__ == '__main__':
resultList = searchFromFile(sys.argv[1], sys.argv[2])
for page in resultList:
print("page content:",page.extractText())
我尝试使用 PyPDF2
和 Python3 从给定文件中搜索关键字。函数searchFromFile(path:str,keyword:str) -> List[PageObject]
如下:
def searchFromFile(path:str,keyword:str) -> List[PageObject]:
pdf = pypdf.PdfFileReader(open(path, "rb"))
if pdf.isEncrypted:
pdf.decrypt('')
numberOfPages = pdf.getNumPages()
result = [PageObject]
for pageNumber in range(0,numberOfPages):
page = pdf.getPage(pageNumber)
text = page.extractText()
if keyword in text:
result.append(page)
return result
if __name__ == '__main__':
resultList = searchFromFile(sys.argv[1], sys.argv[2])
for page in resultList:
print("page content:",page.extractText())
return 类型是 PageObject
的列表,因此我可以像上面的代码一样在 main
中使用 PageObject
的方法。但是出现以下错误:
Traceback (most recent call last):
File"C:\Users\...\git\python\venv\pdftool\SearchFromPdf.py", line 28, in <module>
print("page content:",page.extractText())
TypeError: extractText() missing 1 required positional argument: 'self'
问题:如何解决这个错误?
您可以确认此代码有效吗?
#sudo apt-get install python3-pypdf2
import PyPDF2 as pypdf
def searchFromFile(path:str,keyword:str):
pdf = pypdf.PdfFileReader(open(path, "rb"))
if pdf.isEncrypted:
pdf.decrypt('')
numberOfPages = pdf.getNumPages()
# ~ result = [PageObject]
result = []
for pageNumber in range(0,numberOfPages):
print ("page",pageNumber,"/",numberOfPages)
page = pdf.getPage(pageNumber)
text = page.extractText()
if keyword in text:
result.append(page)
return result
if __name__ == '__main__':
resultList = searchFromFile(sys.argv[1], sys.argv[2])
for page in resultList:
print("page content:",page.extractText())