不一致的 Pytesseract

Question

我有一个充满图像的目录，想从其中提取值。

我不会打扰您从原始图像中提取文本的确切位置。这只是一个卷积函数。

这是一个工作示例：

提取的文本（这实际上是一个 T/F 的 numpy 数组，用 matplotlib imsave(name,image,cmap='gray') 保存为图像）：

如果我现在运行

pytesseract.image_to_string(image2)

或

pytesseract.image_to_string(image2,config="--psm 7")

结果如预期的那样是“3 000 x”。

这是一个失败的例子：

提取的文本（这实际上是一个 T/F 的 numpy 数组，用 matplotlib imsave(name,image,cmap='gray') 保存为图像）：

如果我现在运行

pytesseract.image_to_string(image2)

或

pytesseract.image_to_string(image2,config="--psm 7")

结果是'i imol els 4'

It seems odd to me that there'd be such a big difference for such a similar process. Are there parameters to help pytesseract, eg the expected size of the characters, the format, etc?

PS：我目前对这个问题的解决方案是使用卷积函数将其与我已经手动阅读过的示例目录进行比较（我个人的 OCR 比 pytesseract 好但慢！）。这已经足够了，但是如果能有一个额外的自动化级别就好了！

Answer 1

我反转你的图像，然后运行这个命令。

tesseract hluZr.png stdout -l eng --oem 3 --psm 6
1508 x

不一致的 Pytesseract

Inconsistent Pytesseract

python

python-tesseract