Tesseract 不转换某些图像

Question

我正在编写 python 代码来将一些图像转换为字符串。我有一些 png 格式的手机号码图片。但是我只有一个转换成文本，其他的没有转换。

这是我的代码：

try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract
import os

THIS_FOLDER = os.path.dirname(os.path.abspath(__file__))
my_file = os.path.join(THIS_FOLDER, 'images/3564.jpg')

def ocr_core(filename):
    """
    This function will handle the core OCR processing of images.
    """
    text = pytesseract.image_to_string(Image.open(filename))  # We'll use Pillow's Image class to open the image and pytesseract to detect the string in the image
    return text


for x in range(24):

    number = ocr_core('/Users/evilslab/Documents/Websites/www.futurepoint.dev.cc/dobuyme/python/images/'+str(x)+'.png')

    print ("The number is "+number)

我有 24 张图片，我正在获取第 9 张图片的值。

这张图片有效:

这不起作用

为什么会这样？

Answer 1

有时图像需要一些工作来提高其质量。

参见 Tesseract Wiki：Improving the quality of the output

在您的示例中，我只需将图像大小至少调整 120% 即可获得数字。

from PIL import Image
import pytesseract
import os

folder = os.path.dirname(os.path.abspath(__file__))

def ocr_core(filename):
    image = Image.open(filename)
    w, h = image.size

    #image = image.resize((int(w*1.2), int(h*1.2))) # 120%
    image = image.resize((w*2, h*2)) # 200% 

    #text = pytesseract.image_to_string(image, config='-c tessedit_char_whitelist="0123456789+"')
    text = pytesseract.image_to_string(image)
    text = text.replace(' ', '')

    return text

for filename in os.listdir(folder):
    if filename.endswith('.png'):
        number = ocr_core(os.path.join(folder, filename))
        print("number:", number)

编辑： 当我使用选项 pms=7 时，即使不调整大小它也能识别数字，这意味着 "Treat the image as a single text line."（参见 Page segmentation method）

text = pytesseract.image_to_string(image, config='--psm 7')

Tesseract 不转换某些图像

Tesseract not converting some images

python

tesseract