Tesserocr 无法识别文本

Question

我想请教如何解决tesserocr无法识别图像中特定线条的问题。

这是图片。来源来自 Simple Digit Recognition OCR in OpenCV-Python

代码

from PIL import Image
from tesserocr import PyTessBaseAPI, RIL

image = Image.open('test3.png')
with PyTessBaseAPI() as api:
    api.SetImage(image)
    boxes = api.GetComponentImages(RIL.TEXTLINE, True)
    print 'Found {} textline image components.'.format(len(boxes))
    for i, (im, box, _, _) in enumerate(boxes):
        api.SetRectangle(box['x'], box['y'], box['w'], box['h'])
        ocrResult = api.GetUTF8Text()
        conf = api.MeanTextConf()
        result = (u"Box[{0}]: x={x}, y={y}, w={w}, h={h}, "
            "confidence: {1}, text: {2}").format(i, conf, ocrResult, **box)

    print result

结果是这样的

Found 5 textline image components.
Box[0]: x=10, y=5, w=582, h=29, confidence: 81, text: 9821480865132823066470938


Box[1]: x=9, y=55, w=581, h=30, confidence: 91, text: 4460955058223172535940812


Box[2]: x=10, y=106, w=575, h=30, confidence: 90, text: 8481117450284102701938521


Box[3]: x=12, y=157, w=580, h=30, confidence: 0, text:
Box[4]: x=11, y=208, w=581, h=30, confidence: 89, text: 6442881097566593344612847

它无法识别方框 3 中的数字。我应该添加或修改什么脚本才能使方框 3 显示正确的结果？

感谢您的帮助。

Answer 1

使用默认的 psm 3 和 oem 3 模式，Tesseract 4.00.00alpha 可以正确识别它。下面是结果。

如果您仍在使用 v3.x，建议使用 tesserocr 将 tesseract 升级到 v4.0。

EDIT:

To upgrade tesserocr to support v4.00.00.alpha, check this "Is any plan to porting tesseract 4.0 (alpha)" issue page. There are guidelines to make it works.

Answer 2

在下面的代码中出现了正确的 OCR 结果，但没有 x、y、w、h 和置信度信息。

import tesserocr
from PIL import Image

print tesserocr.tesseract_version()  # print tesseract-ocr version

image = Image.open('SO_5TextLines.png')

lines = tesserocr.image_to_text(image)  # print ocr text from image
for line in lines.split("\r"):
    print line

输出：

tesseract 3.05.00
 leptonica-1.74.1
  libjpeg 8d : libpng 1.6.27 : libtiff 4.0.6 : zlib 1.2.8 : libopenjp2 2.1.2

9821480865132823066470938
4460955058223172535940812
8481117450284102701938521
1055596446229489549303819
6442881097566593344612847

在 OSX Sierra 中输入运行你的代码并得到相同的结果，但第 4 行丢失。看起来问题出在api.SetRectangle()，你可以修改你的代码到print boxes进一步检查。示例代码只是基于您提供的示例文本图像，它需要用更多的图像进行测试以验证它是否适合所有。

希望这对你有用。

Tesserocr 无法识别文本

Tesserocr did not recognize text

python

tesseract