多页 Tiff 图像的 PyTesseract 错误

Question

当我读入一个 15 页的多页 Tiff 图像并且是一个白色背景的黑色 letters/words 文档时，PyTesseract 在我循环页面并转换为字符串。

我将 pytesseract 包与 pyocr.builders 一起使用。单页似乎工作正常，但我相信当图像不是 RGB 时程序转换为 RGB 时出错。

img = Image.open(r'\users\ai\text.tiff')
img.load()
txt = ""
for frame in range(0, img.n_frames):
    img.seek(frame)
    txt += tool.image_to_string(img,builder=pyocr.builders.TextBuilder())

预期输出是 jupyter 中显示的全部 15 页 window。

错误信息

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-17-e59bdf3b773c> in <module>
      2 for frame in range(0, img.n_frames):
      3     img.seek(frame)
----> 4     txt += tool.image_to_string(img,builder=pyocr.builders.TextBuilder())
      5 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyocr\tesseract.py in image_to_string(image, lang, builder)
    357     with tempfile.TemporaryDirectory() as tmpdir:
    358         if image.mode != "RGB":
--> 359             image = image.convert("RGB")
    360         image.save(os.path.join(tmpdir, "input.bmp"))
    361         (status, errors) = run_tesseract("input.bmp", "output", cwd=tmpdir,

~\AppData\Local\Continuum\anaconda3\lib\site-packages\PIL\Image.py in convert(self, mode, matrix, dither, palette, colors)
    932         """
    933 
--> 934         self.load()
    935 
    936         if not mode and self.mode == "P":

~\AppData\Local\Continuum\anaconda3\lib\site-packages\PIL\TiffImagePlugin.py in load(self)
   1097     def load(self):
   1098         if self.use_load_libtiff:
-> 1099             return self._load_libtiff()
   1100         return super(TiffImageFile, self).load()
   1101 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\PIL\TiffImagePlugin.py in _load_libtiff(self)
   1189 
   1190         if err < 0:
-> 1191             raise IOError(err)
   1192 
   1193         return Image.Image.load(self)

OSError: -9

Answer 1

对于这样的问题，您应该提供 Minimum Reproducible Example，因为遗漏了一些代码。此外，您应该提供您的测试图像。但是，对于此示例，您无法附加多页 TIFF，因此将 link 附加到一个就好了。

我找到了 this test image from this question。这是一个 10 页的 TIFF。

这是一个使用 pyocr 的解决方案：

from PIL import Image

import pytesseract
import pyocr
import pyocr.builders

tools = pyocr.get_available_tools()
tool = tools[0]

# pytesseract.pytesseract.tesseract_cmd = 'C:\Program Files (x86)\Tesseract-OCR\tesseract.exe'


image = Image.open('multipage_tiff_example.tif')

config = ("--psm 6")

txt = ''
for frame in range(image.n_frames):
    image.seek(frame)
    txt = tool.image_to_string(image, builder=pyocr.builders.TextBuilder())
    print(txt)

这是一个使用 pytesseract 的解决方案：

from PIL import Image
import pytesseract

# pytesseract.pytesseract.tesseract_cmd = 'C:\Program Files (x86)\Tesseract-OCR\tesseract.exe'

image = Image.open('multipage_tiff_example.tif')

config = ("--psm 6")

txt = ''
for frame in range(image.n_frames):
    image.seek(frame)
    txt += pytesseract.image_to_string(image, config = config, lang='eng') + '\n'

print(txt)

都给出了这个输出：

Multipage
TIFF
Example
Page 1
Multipage
TIFF
Example
Page 2
Multipage
TIFF
Example
Page 3
Multipage
TIFF
Example
Page 4
Multipage
TIFF
Example
Page5
Multipage
TIFF
Example
Page 6
Multipage
TIFF
Example
Page /
Multipage
TIFF
Example
Page 8
Multipage
TIFF
Example
Page 9
Multipage
TIFF

Example

Page 10

多页 Tiff 图像的 PyTesseract 错误

PyTesseract Error for Multi Page Tiff Image

python-3.x

python-tesseract