无法从图像中提取单词

Question

我在 python 中结合 pytesseract 编写了一个脚本，用于从图像中提取单词。该图像中只有一个词 TOOLS 可用，这就是我所追求的。目前，我的以下脚本给出了错误的输出，即 WIS。我该怎么做才能获得文本？

这是我的脚本：

import requests, io, pytesseract
from PIL import Image

response = requests.get('http://facweb.cs.depaul.edu/sgrais/images/Type/Tools.jpg')
img = Image.open(io.BytesIO(response.content))
img = img.resize([100,100], Image.ANTIALIAS)
img = img.convert('L')
img = img.point(lambda x: 0 if x < 170 else 255)
imagetext = pytesseract.image_to_string(img)
print(imagetext)
# img.show()

这是我运行上面脚本修改后图片的状态：

我的输出：

WIS

预期输出：

TOOLS

Answer 1

您实施的关键问题在于：

img = img.resize([100,100], Image.ANTIALIAS)
img = img.point(lambda x: 0 if x < 170 else 255)

您可以尝试不同的大小和不同的阈值：

import requests, io, pytesseract
from PIL import Image
from PIL import ImageFilter

response = requests.get('http://facweb.cs.depaul.edu/sgrais/images/Type/Tools.jpg')
img = Image.open(io.BytesIO(response.content))
filters = [
    # ('nearest', Image.NEAREST),
    ('box', Image.BOX),
    # ('bilinear', Image.BILINEAR),
    # ('hamming', Image.HAMMING),
    # ('bicubic', Image.BICUBIC),
    ('lanczos', Image.LANCZOS),
]

subtle_filters = [
    # 'BLUR',
    # 'CONTOUR',
    'DETAIL',
    'EDGE_ENHANCE',
    'EDGE_ENHANCE_MORE',
    # 'EMBOSS',
    'FIND_EDGES',
    'SHARPEN',
    'SMOOTH',
    'SMOOTH_MORE',
]

for name, filt in filters:
    for subtle_filter_name in subtle_filters:
        for s in range(220, 250, 10):
            for threshold in range(250, 253, 1):
                img_temp = img.copy()
                img_temp.thumbnail([s,s], filt)
                img_temp = img_temp.convert('L')
                img_temp = img_temp.point(lambda x: 0 if x < threshold else 255)
                img_temp = img_temp.filter(getattr(ImageFilter, subtle_filter_name))
                imagetext = pytesseract.image_to_string(img_temp)
                print(s, threshold, name, subtle_filter_name, imagetext)
                with open('thumb%s_%s_%s_%s.jpg' % (s, threshold, name, subtle_filter_name), 'wb') as g:
                    img_temp.save(g)

看看什么适合你。

我建议您在保持原始比例的同时调整图片大小。您也可以尝试 img_temp.convert('L')

的替代方法

迄今为止最佳：TWls 和 T0018

您可以尝试手动操作图像，看看是否可以找到一些可以提供更好输出的编辑（例如 http://gimpchat.com/viewtopic.php?f=8&t=1193）

通过提前了解字体，您可能也会获得更好的结果。

Answer 2

关键是将图像变换与 tesseract 能力相匹配。您的主要问题是字体不是通常的字体。您只需要

from PIL import Image, ImageEnhance, ImageFilter

response = requests.get('http://facweb.cs.depaul.edu/sgrais/images/Type/Tools.jpg')
img = Image.open(io.BytesIO(response.content))

# remove texture
enhancer = ImageEnhance.Color(img)
img = enhancer.enhance(0)   # decolorize
img = img.point(lambda x: 0 if x < 250 else 255) # set threshold
img = img.resize([300, 100], Image.LANCZOS) # resize to remove noise
img = img.point(lambda x: 0 if x < 250 else 255) # get rid of remains of noise
# adjust font weight
img = img.filter(ImageFilter.MaxFilter(11)) # lighten the font ;)
imagetext = pytesseract.image_to_string(img)
print(imagetext)

瞧，

TOOLS

被识别。

无法从图像中提取单词

Unable to extract a word out of an image

python

web-scraping

python-imaging-library

python-3.x

python-tesseract