使用 pytesseract 检测孟加拉字符

Question

我正在尝试使用 python 从图像中检测孟加拉字符，所以我决定使用 pytesseract。为此，我使用了以下代码：

import pytesseract
from PIL import Image, ImageEnhance, ImageFilter

im = Image.open("input.png") # the second one
im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save('temp2.png')
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
text = pytesseract.image_to_string(Image.open('temp2.png'),lang="ben")
print text

问题是，如果我给了一张英文字符的图像，就会被检测到。但是当我写 lang="ben" 并从孟加拉字符的图像中检测时，我的代码是运行ning 无休止的时间或永远。

P.S: 我已经将孟加拉语火车数据下载到 tessdata 文件夹中，我正在尝试运行它在 PyCharm。

谁能帮我解决这个问题？

sample of input.png

Answer 1

我在 Windows 中添加了 Bangla(india) 语言。下载 ben.traineddata 到 TESSDATA_PREFIX，在我的电脑上等于 C:\Program Files\Tesseract 4.0.0\tessdata。然后运行,

> tesseract -l ben bangla.jpg bangla_out

在命令提示符下，2 秒后得到如下结果。即使我不懂该语言，结果看起来也不错。

您是否尝试过在命令提示符下运行 tesseract 来验证它是否适用于 -l ben？

EDIT:

Used Spyder, similar to PyCharm, which comes with Anaconda to test it. Modified your code to call Tesseract as below.

pytesseract.pytesseract.tesseract_cmd = "C:/Program Files/Tesseract 4.0.0/tesseract.exe"

Test Code in Spyder:

import pytesseract
from PIL import Image, ImageEnhance, ImageFilter
import os

im = Image.open("bangla.jpg") # the second one
im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save("bangla_pp.jpg")

pytesseract.pytesseract.tesseract_cmd = "C:/Program Files/Tesseract 4.0.0/tesseract.exe"
text = pytesseract.image_to_string(Image.open("bangla_pp.jpg"),lang="ben")
print text

It works and produced result below on the processed image. Apparently, the OCR result of the processed image is not as good as the original one.

Result from the processed bangla_pp.jpg:

   প্রত্যাবর্তনকারীরা
   তাঁদের দেশে গিয়ে

   -~~-<~~~~--

   প্রত্যাবর্তন-পরবর্তী
   আর্থিক সহায়তা
    = পাবেন তার

Result from original image, directly feed to Tesseract.

Code:

from PIL import Image    
import pytesseract as tess

print tess.image_to_string(Image.open('bangla.jpg'), lang='ben')

Output:

প্রত্যাবর্তনকারীরা
তাঁদের দেশে গিয়ে

প্রত্যাবর্তন-পরবর্তী
আর্থিক সহায়তা
পাবেন তার

Answer 2

我已经从这里

在windows中安装了一些字体

https://www.omicronlab.com/bangla-fonts.html

在那之后，它在 Pycharm 中对我来说非常好用。

使用 pytesseract 检测孟加拉字符

Detecting Bangla character using pytesseract

python

python-tesseract