如何限制tesserocr中识别的字符？

Question

使用 tesserocr 时，如何将 Tesseract 识别的字符集限制为数字？

我从 this 了解到，如果我使用的是 C++，我可以在配置文件中设置 tessedit_char_whitelist，但我不知道 Python 中 tesserocr 中的类似方法.

一般来说，如果 reader 已经了解 C++ 的 Tesseract API，tesserocr documentation 会提供有用的帮助。由于我不精通c++，我希望避免为了使用tesserocr而必须阅读c++源代码。

如果有人能告诉我我在 python 中实际需要写的内容或从配置设置到 Python 代码的一般规则，那就太好了。提前致谢。

Answer 1

Tesserocr 作为C++ API，可以通过SetVariable.

函数设置白名单

一个例子：

from tesserocr import PyTessBaseAPI
from string import digits

with PyTessBaseAPI() as api:
    api.SetVariable('tessedit_char_whitelist', digits)
    api.SetImageFile('image.png')
    print api.GetUTF8Text()  # it will print only digits

如果您想要另一种更直接且独立于 C++ 的方法 API，请尝试使用 pytesseract 模块。

pytesseract 示例：

import pytesseract
from PIL import Image
from string import digits

image = Image.open('image.png')
print pytesseract.image_to_string(
    image, config='-c tessedit_char_whitelist=' + digits)

如何限制tesserocr中识别的字符？

How to restrict the recognized characters in tesserocr?

python

python-2.7

python-tesseract