pytesseract 无法识别字母前面的符号

Question

尝试使用 pytesseract 读取几个文本块，但它无法识别位于单词前面或单词之间的符号。但是，当符号位于数字前面时，它会识别它们。

示例：

'#test $test %test' 图像打印错误'Htest Stest Stest'

'#500 0 %500' 图像打印正确 '#500 0 %500'

这是我的代码：

    import cv2
    import pytesseract
    from PIL import Image

    image = cv2.imread("test.png")
    image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    threshold = 225
    _, img_binarized = cv2.threshold(image, threshold, 255, cv2.THRESH_BINARY)
    pil_img = Image.fromarray(img_binarized)

    pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'

    msg = pytesseract.image_to_string(pil_img)
    print(msg)

我在 image_to_string 调用中尝试了一系列不同的配置设置，但没有找到任何有效的方法，我们将不胜感激。

Answer 1

我最终将所有 .traineddata 文件从 https://tesseract-ocr.github.io/tessdoc/Data-Files.html 下载到我的 Tesseract-OCR 文件夹，并使用 image_to_string 的语言参数循环遍历所有这些文件。出于某种原因，一些 select 与英语共享相同字母表的语言工作得很好（意大利语和克罗地亚语工作得最好）。

我的代码和上面一样，只是语言有所调整：

msg = pytesseract.image_to_string(pil_img, lang='ita')

pytesseract 无法识别字母前面的符号

pytesseract not recognizing symbols in front of letters

python

tesseract

python-tesseract