从白色背景图像上的浅色文本中提取文本

Question

我有如下图片：

我想从中提取文本，应该是 ws35，我已经尝试使用 pytesseract 库使用以下方法：

pytesseract.image_to_string(Image.open(path))

但它 returns 没什么...我做错了什么吗？如何使用 OCR 取回文本？我需要在上面应用一些过滤器吗？

Answer 1

问题是这张图画质很差而且噪点很大！甚至专业和企业计划也在为此苦苦挣扎

您很可能以前见过验证码，这是因为它会连同您的答案和图像一起发送回数据库，然后用于训练计算机读取此类图像。

short answer is: pytesseract cant read the text inside this image and most likely no module or proffesional programs can read it either.

Answer 2

您可以尝试以下方法：

Binarize 使用您选择的方法的图像（在这种情况下，使用 127 的阈值似乎就足够了）
使用 minimum filter to connect the lose dots to form characters. Thereby, a filter with r=4 seems to work quite good:
如有必要，可以通过应用 median blur (r=4):

因为我个人不使用 tesseract 我无法尝试这张图片，但是在线 ocr 工具似乎能够正确识别序列（尤其是如果您使用模糊版本）。

Answer 3

您可能需要在上面应用一些图像 processing/enhancement。查看 this post 阅读建议并尝试申请。

Answer 4

类似于@SilverMonkey 的建议：高斯模糊后跟 Otsu 阈值化。

Extract text from light text on withe background image