无法从图像中提取单词
Unable to extract a word out of an image
我在 python
中结合 pytesseract
编写了一个脚本,用于从图像中提取单词。该图像中只有一个词 TOOLS
可用,这就是我所追求的。目前,我的以下脚本给出了错误的输出,即 WIS
。我该怎么做才能获得文本?
这是我的脚本:
import requests, io, pytesseract
from PIL import Image
response = requests.get('http://facweb.cs.depaul.edu/sgrais/images/Type/Tools.jpg')
img = Image.open(io.BytesIO(response.content))
img = img.resize([100,100], Image.ANTIALIAS)
img = img.convert('L')
img = img.point(lambda x: 0 if x < 170 else 255)
imagetext = pytesseract.image_to_string(img)
print(imagetext)
# img.show()
这是我运行上面脚本修改后图片的状态:
我的输出:
WIS
预期输出:
TOOLS
您实施的关键问题在于:
img = img.resize([100,100], Image.ANTIALIAS)
img = img.point(lambda x: 0 if x < 170 else 255)
您可以尝试不同的大小和不同的阈值:
import requests, io, pytesseract
from PIL import Image
from PIL import ImageFilter
response = requests.get('http://facweb.cs.depaul.edu/sgrais/images/Type/Tools.jpg')
img = Image.open(io.BytesIO(response.content))
filters = [
# ('nearest', Image.NEAREST),
('box', Image.BOX),
# ('bilinear', Image.BILINEAR),
# ('hamming', Image.HAMMING),
# ('bicubic', Image.BICUBIC),
('lanczos', Image.LANCZOS),
]
subtle_filters = [
# 'BLUR',
# 'CONTOUR',
'DETAIL',
'EDGE_ENHANCE',
'EDGE_ENHANCE_MORE',
# 'EMBOSS',
'FIND_EDGES',
'SHARPEN',
'SMOOTH',
'SMOOTH_MORE',
]
for name, filt in filters:
for subtle_filter_name in subtle_filters:
for s in range(220, 250, 10):
for threshold in range(250, 253, 1):
img_temp = img.copy()
img_temp.thumbnail([s,s], filt)
img_temp = img_temp.convert('L')
img_temp = img_temp.point(lambda x: 0 if x < threshold else 255)
img_temp = img_temp.filter(getattr(ImageFilter, subtle_filter_name))
imagetext = pytesseract.image_to_string(img_temp)
print(s, threshold, name, subtle_filter_name, imagetext)
with open('thumb%s_%s_%s_%s.jpg' % (s, threshold, name, subtle_filter_name), 'wb') as g:
img_temp.save(g)
看看什么适合你。
我建议您在保持原始比例的同时调整图片大小。您也可以尝试 img_temp.convert('L')
的替代方法
迄今为止最佳:TWls
和 T0018
您可以尝试手动操作图像,看看是否可以找到一些可以提供更好输出的编辑(例如 http://gimpchat.com/viewtopic.php?f=8&t=1193)
通过提前了解字体,您可能也会获得更好的结果。
关键是将图像变换与 tesseract
能力相匹配。您的主要问题是字体不是通常的字体。您只需要
from PIL import Image, ImageEnhance, ImageFilter
response = requests.get('http://facweb.cs.depaul.edu/sgrais/images/Type/Tools.jpg')
img = Image.open(io.BytesIO(response.content))
# remove texture
enhancer = ImageEnhance.Color(img)
img = enhancer.enhance(0) # decolorize
img = img.point(lambda x: 0 if x < 250 else 255) # set threshold
img = img.resize([300, 100], Image.LANCZOS) # resize to remove noise
img = img.point(lambda x: 0 if x < 250 else 255) # get rid of remains of noise
# adjust font weight
img = img.filter(ImageFilter.MaxFilter(11)) # lighten the font ;)
imagetext = pytesseract.image_to_string(img)
print(imagetext)
瞧,
TOOLS
被识别。
我在 python
中结合 pytesseract
编写了一个脚本,用于从图像中提取单词。该图像中只有一个词 TOOLS
可用,这就是我所追求的。目前,我的以下脚本给出了错误的输出,即 WIS
。我该怎么做才能获得文本?
这是我的脚本:
import requests, io, pytesseract
from PIL import Image
response = requests.get('http://facweb.cs.depaul.edu/sgrais/images/Type/Tools.jpg')
img = Image.open(io.BytesIO(response.content))
img = img.resize([100,100], Image.ANTIALIAS)
img = img.convert('L')
img = img.point(lambda x: 0 if x < 170 else 255)
imagetext = pytesseract.image_to_string(img)
print(imagetext)
# img.show()
这是我运行上面脚本修改后图片的状态:
我的输出:
WIS
预期输出:
TOOLS
您实施的关键问题在于:
img = img.resize([100,100], Image.ANTIALIAS)
img = img.point(lambda x: 0 if x < 170 else 255)
您可以尝试不同的大小和不同的阈值:
import requests, io, pytesseract
from PIL import Image
from PIL import ImageFilter
response = requests.get('http://facweb.cs.depaul.edu/sgrais/images/Type/Tools.jpg')
img = Image.open(io.BytesIO(response.content))
filters = [
# ('nearest', Image.NEAREST),
('box', Image.BOX),
# ('bilinear', Image.BILINEAR),
# ('hamming', Image.HAMMING),
# ('bicubic', Image.BICUBIC),
('lanczos', Image.LANCZOS),
]
subtle_filters = [
# 'BLUR',
# 'CONTOUR',
'DETAIL',
'EDGE_ENHANCE',
'EDGE_ENHANCE_MORE',
# 'EMBOSS',
'FIND_EDGES',
'SHARPEN',
'SMOOTH',
'SMOOTH_MORE',
]
for name, filt in filters:
for subtle_filter_name in subtle_filters:
for s in range(220, 250, 10):
for threshold in range(250, 253, 1):
img_temp = img.copy()
img_temp.thumbnail([s,s], filt)
img_temp = img_temp.convert('L')
img_temp = img_temp.point(lambda x: 0 if x < threshold else 255)
img_temp = img_temp.filter(getattr(ImageFilter, subtle_filter_name))
imagetext = pytesseract.image_to_string(img_temp)
print(s, threshold, name, subtle_filter_name, imagetext)
with open('thumb%s_%s_%s_%s.jpg' % (s, threshold, name, subtle_filter_name), 'wb') as g:
img_temp.save(g)
看看什么适合你。
我建议您在保持原始比例的同时调整图片大小。您也可以尝试 img_temp.convert('L')
迄今为止最佳:TWls
和 T0018
您可以尝试手动操作图像,看看是否可以找到一些可以提供更好输出的编辑(例如 http://gimpchat.com/viewtopic.php?f=8&t=1193)
通过提前了解字体,您可能也会获得更好的结果。
关键是将图像变换与 tesseract
能力相匹配。您的主要问题是字体不是通常的字体。您只需要
from PIL import Image, ImageEnhance, ImageFilter
response = requests.get('http://facweb.cs.depaul.edu/sgrais/images/Type/Tools.jpg')
img = Image.open(io.BytesIO(response.content))
# remove texture
enhancer = ImageEnhance.Color(img)
img = enhancer.enhance(0) # decolorize
img = img.point(lambda x: 0 if x < 250 else 255) # set threshold
img = img.resize([300, 100], Image.LANCZOS) # resize to remove noise
img = img.point(lambda x: 0 if x < 250 else 255) # get rid of remains of noise
# adjust font weight
img = img.filter(ImageFilter.MaxFilter(11)) # lighten the font ;)
imagetext = pytesseract.image_to_string(img)
print(imagetext)
瞧,
TOOLS
被识别。