使用 pytesseract 执行 OCR 时出错

Question

我想使用 pytesseract。这是我的代码。

import pytesseract 
from pdf2image import convert_from_path 

PDF_file = 'file.pdf'
text = '' 
pages = convert_from_path(PDF_file, 500)
pageText = str(((pytesseract.image_to_string(pages[0]))))

结果我得到这个错误

Traceback (most recent call last): File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pdf2image\pdf2image.py", line 409, in pdfinfo_from_path proc = Popen(command, env=env, stdout=PIPE, stderr=PIPE) File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\subprocess.py", line 854, in init self._execute_child(args, executable, preexec_fn, close_fds, File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\subprocess.py", line 1307, in _execute_child hp, ht, pid, tid = _winapi.CreateProcess(executable, args, FileNotFoundError: [WinError 2] The system cannot find the file specified

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Users\user\Desktop\projects\pdfparser\pdftest.py", line 13, in pages = convert_from_path(PDF_file, 500) File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pdf2image\pdf2image.py", line 89, in convert_from_path page_count = pdfinfo_from_path(pdf_path, userpw, poppler_path=poppler_path)["Pages"] File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pdf2image\pdf2image.py", line 430, in pdfinfo_from_path raise PDFInfoNotInstalledError( pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

Answer 1

正如很多评论已经指出的那样，错误信息

PDFInfoNotInstalledError( pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

准确告诉您出了什么问题：未安装 Poppler。请参阅 README 以获取这方面的帮助。

你看，pdf2image 只是 pdftoppm 命令行实用程序的包装器。在 Linux 上它是默认安装的，所以你不需要费心去安装它，但在 Windows 上它不是。

使用 pytesseract 执行 OCR 时出错

Error while performing OCR using pytesseract

python

ocr

python-3.x

python-tesseract