tesseract-ocr 甚至不从简单的图像中读取文本

Question

为了其他人在 Google 中找到这个，我将详细解释我的问题，尽管它应该是显而易见的。我正在使用 tesseract-ocr 希望从图像中删除文本。我遇到的问题是 tesseract-orc 即使在最简单的图像中也找不到文本。在下面查看我的系统和版本信息：

[root@tower python2]# uname -a
Linux tower.youds.com 2.6.32-504.12.2.el6.x86_64 #1 SMP Wed Mar 11 22:03:14 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
[root@tower python2]# tesseract -v
tesseract 3.02.02
leptonica-1.71
libjpeg 6b : libpng 1.2.52 : zlib 1.2.3

我正在尝试使用 php ocr class 的示例图像，但是 or class 的功能不足以满足我的需要，显然 tesseract 是。

这是我运行 tesseract 时发生的事情：

[root@tower phpocr]# tesseract W1.png output.file
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
Empty page!!
Empty page!!
[root@tower phpocr]#

这些是我正在使用的图像：

http://arbiter.rogues-alliance.com/includes/phpocr/W.png

http://arbiter.rogues-alliance.com/includes/phpocr/W1.png

已编辑：包含更多图片。

Answer 1

尝试添加一个pagesegmode选项，比如-psm 10（即10 = Treat the image as a single character），这似乎是改进单个字符的识别。使用 tesseract --help.

列出其他选项

不幸的是，当我运行你的示例文件 -psm 10 W.png 和 W1.png 分别被识别为 w 和 N，尽管 this one 等较大的图像正确识别为 W。我怀疑是您样本的 size/font 导致了这种情况。此外，这纯粹是推测，tesseract 可能会在上下文中更好地识别此字符，即与使用相同字体和大小的其他字符一起识别。

tesseract-ocr 甚至不从简单的图像中读取文本

tesseract-orc not reading text from even simple images

php

ocr

tesseract