tesseract 从图像中仅检测到 4 个单词

Question

我有很简单的python代码：

import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = 'C:\Tesseract-OCR\tesseract.exe'
img = cv2.imread('1.png')
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

hImg,wImg,_ = img.shape

#detecting words
boxes = pytesseract.image_to_data(img)
for x,b in enumerate(boxes.splitlines()):
    if x!=0:
        b = b.split()
        if len(b) == 12:
            x,y,w,h = int(b[6]), int(b[7]), int(b[8]), int(b[9])
            cv2.rectangle(img, (x,y), (w+x,h+y), (0,0,255), 3)


cv2.imshow('result', img)
cv2.waitKey(0)

但结果很有趣。它只检测到 4 个单词。可能是什么原因？

Answer 1

如果您 improve the quality 您提供给 Tesseract 的图像，您将获得更好的 OCR 结果。

While tesseract version 3.05 (and older) handle inverted image (dark background and light text) without problem, for 4.x version use dark text on light background.

从 BGR 转换为 HLS，以便稍后从图像上半部分的数字中删除背景颜色。然后，使用 cv2.inRange 创建一个“蓝色”蒙版，并将任何非“蓝色”的内容替换为白色。

hls=cv2.cvtColor(img,cv2.COLOR_BGR2HLS)

# Define lower and upper limits for the number colors.
blue_lo=np.array([114, 70, 70])
blue_hi=np.array([154, 225, 225])

# Mask image to only select "blue"
mask=cv2.inRange(hls,blue_lo,blue_hi)

# copy original image
img1 = img.copy()
img1[mask==0]=(255,255,255)

通过将图像转换为黑白来帮助 pytesseract

This is converting an image to black and white. Tesseract does this internally (Otsu algorithm), but the result can be suboptimal, particularly if the page background is of uneven darkness.

rgb = cv2.cvtColor(img1, cv2.COLOR_HLS2RGB)
gray = cv2.cvtColor(rgb, cv2.COLOR_RGB2GRAY)
_, img1 = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
cv2.imshow('img_to_binary',img1)

在之前创建的 img1 上使用 image_to_data 并继续应用您现有的代码。

...
hImg,wImg,_ = img.shape

#detecting words
boxes = pytesseract.image_to_data(img1)
for x,b in enumerate(boxes.splitlines()):
    ...
...

tesseract 从图像中仅检测到 4 个单词

tesseract detects only 4 words from image

python

opencv

tesseract

python-tesseract

opencv-python