多页 Tiff 图像的 PyTesseract 错误
PyTesseract Error for Multi Page Tiff Image
当我读入一个 15 页的多页 Tiff 图像并且是一个白色背景的黑色 letters/words 文档时,PyTesseract 在我循环页面并转换为字符串。
我将 pytesseract 包与 pyocr.builders 一起使用。单页似乎工作正常,但我相信当图像不是 RGB 时程序转换为 RGB 时出错。
img = Image.open(r'\users\ai\text.tiff')
img.load()
txt = ""
for frame in range(0, img.n_frames):
img.seek(frame)
txt += tool.image_to_string(img,builder=pyocr.builders.TextBuilder())
预期输出是 jupyter 中显示的全部 15 页 window。
错误信息
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-17-e59bdf3b773c> in <module>
2 for frame in range(0, img.n_frames):
3 img.seek(frame)
----> 4 txt += tool.image_to_string(img,builder=pyocr.builders.TextBuilder())
5
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyocr\tesseract.py in image_to_string(image, lang, builder)
357 with tempfile.TemporaryDirectory() as tmpdir:
358 if image.mode != "RGB":
--> 359 image = image.convert("RGB")
360 image.save(os.path.join(tmpdir, "input.bmp"))
361 (status, errors) = run_tesseract("input.bmp", "output", cwd=tmpdir,
~\AppData\Local\Continuum\anaconda3\lib\site-packages\PIL\Image.py in convert(self, mode, matrix, dither, palette, colors)
932 """
933
--> 934 self.load()
935
936 if not mode and self.mode == "P":
~\AppData\Local\Continuum\anaconda3\lib\site-packages\PIL\TiffImagePlugin.py in load(self)
1097 def load(self):
1098 if self.use_load_libtiff:
-> 1099 return self._load_libtiff()
1100 return super(TiffImageFile, self).load()
1101
~\AppData\Local\Continuum\anaconda3\lib\site-packages\PIL\TiffImagePlugin.py in _load_libtiff(self)
1189
1190 if err < 0:
-> 1191 raise IOError(err)
1192
1193 return Image.Image.load(self)
OSError: -9
对于这样的问题,您应该提供 Minimum Reproducible Example,因为遗漏了一些代码。此外,您应该提供您的测试图像。但是,对于此示例,您无法附加多页 TIFF,因此将 link 附加到一个就好了。
我找到了 this test image from this question。这是一个 10 页的 TIFF。
这是一个使用 pyocr 的解决方案:
from PIL import Image
import pytesseract
import pyocr
import pyocr.builders
tools = pyocr.get_available_tools()
tool = tools[0]
# pytesseract.pytesseract.tesseract_cmd = 'C:\Program Files (x86)\Tesseract-OCR\tesseract.exe'
image = Image.open('multipage_tiff_example.tif')
config = ("--psm 6")
txt = ''
for frame in range(image.n_frames):
image.seek(frame)
txt = tool.image_to_string(image, builder=pyocr.builders.TextBuilder())
print(txt)
这是一个使用 pytesseract 的解决方案:
from PIL import Image
import pytesseract
# pytesseract.pytesseract.tesseract_cmd = 'C:\Program Files (x86)\Tesseract-OCR\tesseract.exe'
image = Image.open('multipage_tiff_example.tif')
config = ("--psm 6")
txt = ''
for frame in range(image.n_frames):
image.seek(frame)
txt += pytesseract.image_to_string(image, config = config, lang='eng') + '\n'
print(txt)
都给出了这个输出:
Multipage
TIFF
Example
Page 1
Multipage
TIFF
Example
Page 2
Multipage
TIFF
Example
Page 3
Multipage
TIFF
Example
Page 4
Multipage
TIFF
Example
Page5
Multipage
TIFF
Example
Page 6
Multipage
TIFF
Example
Page /
Multipage
TIFF
Example
Page 8
Multipage
TIFF
Example
Page 9
Multipage
TIFF
Example
Page 10
当我读入一个 15 页的多页 Tiff 图像并且是一个白色背景的黑色 letters/words 文档时,PyTesseract 在我循环页面并转换为字符串。
我将 pytesseract 包与 pyocr.builders 一起使用。单页似乎工作正常,但我相信当图像不是 RGB 时程序转换为 RGB 时出错。
img = Image.open(r'\users\ai\text.tiff')
img.load()
txt = ""
for frame in range(0, img.n_frames):
img.seek(frame)
txt += tool.image_to_string(img,builder=pyocr.builders.TextBuilder())
预期输出是 jupyter 中显示的全部 15 页 window。
错误信息
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-17-e59bdf3b773c> in <module>
2 for frame in range(0, img.n_frames):
3 img.seek(frame)
----> 4 txt += tool.image_to_string(img,builder=pyocr.builders.TextBuilder())
5
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyocr\tesseract.py in image_to_string(image, lang, builder)
357 with tempfile.TemporaryDirectory() as tmpdir:
358 if image.mode != "RGB":
--> 359 image = image.convert("RGB")
360 image.save(os.path.join(tmpdir, "input.bmp"))
361 (status, errors) = run_tesseract("input.bmp", "output", cwd=tmpdir,
~\AppData\Local\Continuum\anaconda3\lib\site-packages\PIL\Image.py in convert(self, mode, matrix, dither, palette, colors)
932 """
933
--> 934 self.load()
935
936 if not mode and self.mode == "P":
~\AppData\Local\Continuum\anaconda3\lib\site-packages\PIL\TiffImagePlugin.py in load(self)
1097 def load(self):
1098 if self.use_load_libtiff:
-> 1099 return self._load_libtiff()
1100 return super(TiffImageFile, self).load()
1101
~\AppData\Local\Continuum\anaconda3\lib\site-packages\PIL\TiffImagePlugin.py in _load_libtiff(self)
1189
1190 if err < 0:
-> 1191 raise IOError(err)
1192
1193 return Image.Image.load(self)
OSError: -9
对于这样的问题,您应该提供 Minimum Reproducible Example,因为遗漏了一些代码。此外,您应该提供您的测试图像。但是,对于此示例,您无法附加多页 TIFF,因此将 link 附加到一个就好了。
我找到了 this test image from this question。这是一个 10 页的 TIFF。
这是一个使用 pyocr 的解决方案:
from PIL import Image
import pytesseract
import pyocr
import pyocr.builders
tools = pyocr.get_available_tools()
tool = tools[0]
# pytesseract.pytesseract.tesseract_cmd = 'C:\Program Files (x86)\Tesseract-OCR\tesseract.exe'
image = Image.open('multipage_tiff_example.tif')
config = ("--psm 6")
txt = ''
for frame in range(image.n_frames):
image.seek(frame)
txt = tool.image_to_string(image, builder=pyocr.builders.TextBuilder())
print(txt)
这是一个使用 pytesseract 的解决方案:
from PIL import Image
import pytesseract
# pytesseract.pytesseract.tesseract_cmd = 'C:\Program Files (x86)\Tesseract-OCR\tesseract.exe'
image = Image.open('multipage_tiff_example.tif')
config = ("--psm 6")
txt = ''
for frame in range(image.n_frames):
image.seek(frame)
txt += pytesseract.image_to_string(image, config = config, lang='eng') + '\n'
print(txt)
都给出了这个输出:
Multipage
TIFF
Example
Page 1
Multipage
TIFF
Example
Page 2
Multipage
TIFF
Example
Page 3
Multipage
TIFF
Example
Page 4
Multipage
TIFF
Example
Page5
Multipage
TIFF
Example
Page 6
Multipage
TIFF
Example
Page /
Multipage
TIFF
Example
Page 8
Multipage
TIFF
Example
Page 9
Multipage
TIFF
Example
Page 10