如何使用 wand 将扫描的 pdf 转换为文本 python
how to convert scanned pdf to text using wand python
在使用 Wand 和 imageMagick 将扫描的 PDF 转换为文本时,出现以下错误:
错误:
Traceback (most recent call last):
File "C:/Users/gibin/PycharmProjects/ML/Image_PDF/.ksldwjldf.py", line 28, in <module>
Get_text_from_image(r"C:\Users\gibin\PycharmProjects\ML\Image_PDF6676972_image.pdf")
File "C:/Users/gibin/PycharmProjects/ML/Image_PDF/.ksldwjldf.py", line 13, in Get_text_from_image
pdf=wi(filename=pdf_path,resolution=300)
File "C:\Users\gibin\AppData\Local\Programs\Python\Python37-32\lib\site-packages\wand\image.py", line 8212, in __init__
units=units)
File "C:\Users\gibin\AppData\Local\Programs\Python\Python37-32\lib\site-packages\wand\image.py", line 8686, in read
self.raise_exception()
File "C:\Users\gibin\AppData\Local\Programs\Python\Python37-32\lib\site-packages\wand\resource.py", line 240, in raise_exception
raise e
wand.exceptions.DelegateError: FailedToExecuteCommand `"gswin32c.exe" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r300x300" "-sOutputFile=C:/Users/GIBIN_~1./AppData/Local/Temp/magick-23476_sCYGtEq3gb-%d" "-fC:/Users/GIBIN_~1./AppData/Local/Temp/magick-234763X1vpsurlvH5" "-fC:/Users/GIBIN_~1./AppData/Local/Temp/magick-23476fUlS8Tr85dwk"' (The system cannot find the file specified.
) @ error/delegate.c/ExternalDelegateCommand/459
代码:
import io
from PIL import Image
import pytesseract
from wand.image import Image as wi
import gc
pytesseract.pytesseract.tesseract_cmd = r"C:\Users\gibin\AppData\Local\Tesseract-OCR\tesseract.exe"
def Get_text_from_image(pdf_path):
print(pdf_path)
pdf=wi(filename=pdf_path,resolution=300)
pdfImg=pdf.convert('jpeg')
imgBlobs=[]
extracted_text=[]
for img in pdfImg.sequence:
page=wi(image=img)
imgBlobs.append(page.make_blob('jpeg'))
print(len(imgBlobs))
for imgBlob in imgBlobs:
im=Image.open(io.BytesIO(imgBlob))
text=pytesseract.image_to_string(im)
print(text)
extracted_text.append(text)
return ([i.replace("\n","") for i in extracted_text])
Get_text_from_image(r"C:\Users\gibin\PycharmProjects\ML\Image_PDF6676972_image.pdf")
这在安装 GHOSTSCRIPT 并将其添加为环境变量后工作正常。
从 HERE
下载 ghostscript
之后,您需要设置环境变量。
添加新的系统变量:
变量:GS_PROG
值:gswin64c.exe 文件所在位置的完整路径
你见过这个吗?
相反,您还可以使用其他方法将 pdf 转换为 jpg 图像页面。我已经使用 pdf2img 库来做到这一点,如果您可以自由使用任何库,那么更喜欢使用 pdf2img。
在使用 Wand 和 imageMagick 将扫描的 PDF 转换为文本时,出现以下错误:
错误:
Traceback (most recent call last):
File "C:/Users/gibin/PycharmProjects/ML/Image_PDF/.ksldwjldf.py", line 28, in <module>
Get_text_from_image(r"C:\Users\gibin\PycharmProjects\ML\Image_PDF6676972_image.pdf")
File "C:/Users/gibin/PycharmProjects/ML/Image_PDF/.ksldwjldf.py", line 13, in Get_text_from_image
pdf=wi(filename=pdf_path,resolution=300)
File "C:\Users\gibin\AppData\Local\Programs\Python\Python37-32\lib\site-packages\wand\image.py", line 8212, in __init__
units=units)
File "C:\Users\gibin\AppData\Local\Programs\Python\Python37-32\lib\site-packages\wand\image.py", line 8686, in read
self.raise_exception()
File "C:\Users\gibin\AppData\Local\Programs\Python\Python37-32\lib\site-packages\wand\resource.py", line 240, in raise_exception
raise e
wand.exceptions.DelegateError: FailedToExecuteCommand `"gswin32c.exe" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r300x300" "-sOutputFile=C:/Users/GIBIN_~1./AppData/Local/Temp/magick-23476_sCYGtEq3gb-%d" "-fC:/Users/GIBIN_~1./AppData/Local/Temp/magick-234763X1vpsurlvH5" "-fC:/Users/GIBIN_~1./AppData/Local/Temp/magick-23476fUlS8Tr85dwk"' (The system cannot find the file specified.
) @ error/delegate.c/ExternalDelegateCommand/459
代码:
import io
from PIL import Image
import pytesseract
from wand.image import Image as wi
import gc
pytesseract.pytesseract.tesseract_cmd = r"C:\Users\gibin\AppData\Local\Tesseract-OCR\tesseract.exe"
def Get_text_from_image(pdf_path):
print(pdf_path)
pdf=wi(filename=pdf_path,resolution=300)
pdfImg=pdf.convert('jpeg')
imgBlobs=[]
extracted_text=[]
for img in pdfImg.sequence:
page=wi(image=img)
imgBlobs.append(page.make_blob('jpeg'))
print(len(imgBlobs))
for imgBlob in imgBlobs:
im=Image.open(io.BytesIO(imgBlob))
text=pytesseract.image_to_string(im)
print(text)
extracted_text.append(text)
return ([i.replace("\n","") for i in extracted_text])
Get_text_from_image(r"C:\Users\gibin\PycharmProjects\ML\Image_PDF6676972_image.pdf")
这在安装 GHOSTSCRIPT 并将其添加为环境变量后工作正常。 从 HERE
下载 ghostscript之后,您需要设置环境变量。 添加新的系统变量:
变量:GS_PROG
值:gswin64c.exe 文件所在位置的完整路径
你见过这个