tesseract 从图像中仅检测到 4 个单词
tesseract detects only 4 words from image
我有很简单的python代码:
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = 'C:\Tesseract-OCR\tesseract.exe'
img = cv2.imread('1.png')
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
hImg,wImg,_ = img.shape
#detecting words
boxes = pytesseract.image_to_data(img)
for x,b in enumerate(boxes.splitlines()):
if x!=0:
b = b.split()
if len(b) == 12:
x,y,w,h = int(b[6]), int(b[7]), int(b[8]), int(b[9])
cv2.rectangle(img, (x,y), (w+x,h+y), (0,0,255), 3)
cv2.imshow('result', img)
cv2.waitKey(0)
但结果很有趣。它只检测到 4 个单词。可能是什么原因?
如果您 improve the quality
您提供给 Tesseract 的图像,您将获得更好的 OCR 结果。
While tesseract version 3.05 (and older) handle inverted image (dark background and light text) without problem, for 4.x version use dark text on light background.
从 BGR
转换为 HLS
,以便稍后从图像上半部分的数字中删除背景颜色。然后,使用 cv2.inRange
创建一个“蓝色”蒙版,并将任何非“蓝色”的内容替换为白色。
hls=cv2.cvtColor(img,cv2.COLOR_BGR2HLS)
# Define lower and upper limits for the number colors.
blue_lo=np.array([114, 70, 70])
blue_hi=np.array([154, 225, 225])
# Mask image to only select "blue"
mask=cv2.inRange(hls,blue_lo,blue_hi)
# copy original image
img1 = img.copy()
img1[mask==0]=(255,255,255)
通过将图像转换为黑白来帮助 pytesseract
This is converting an image to black and white. Tesseract does this internally (Otsu algorithm), but the result can be suboptimal, particularly if the page background is of uneven darkness.
rgb = cv2.cvtColor(img1, cv2.COLOR_HLS2RGB)
gray = cv2.cvtColor(rgb, cv2.COLOR_RGB2GRAY)
_, img1 = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
cv2.imshow('img_to_binary',img1)
在之前创建的 img1
上使用 image_to_data
并继续应用您现有的代码。
...
hImg,wImg,_ = img.shape
#detecting words
boxes = pytesseract.image_to_data(img1)
for x,b in enumerate(boxes.splitlines()):
...
...
我有很简单的python代码:
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = 'C:\Tesseract-OCR\tesseract.exe'
img = cv2.imread('1.png')
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
hImg,wImg,_ = img.shape
#detecting words
boxes = pytesseract.image_to_data(img)
for x,b in enumerate(boxes.splitlines()):
if x!=0:
b = b.split()
if len(b) == 12:
x,y,w,h = int(b[6]), int(b[7]), int(b[8]), int(b[9])
cv2.rectangle(img, (x,y), (w+x,h+y), (0,0,255), 3)
cv2.imshow('result', img)
cv2.waitKey(0)
但结果很有趣。它只检测到 4 个单词。可能是什么原因?
如果您 improve the quality
您提供给 Tesseract 的图像,您将获得更好的 OCR 结果。
While tesseract version 3.05 (and older) handle inverted image (dark background and light text) without problem, for 4.x version use dark text on light background.
从 BGR
转换为 HLS
,以便稍后从图像上半部分的数字中删除背景颜色。然后,使用 cv2.inRange
创建一个“蓝色”蒙版,并将任何非“蓝色”的内容替换为白色。
hls=cv2.cvtColor(img,cv2.COLOR_BGR2HLS)
# Define lower and upper limits for the number colors.
blue_lo=np.array([114, 70, 70])
blue_hi=np.array([154, 225, 225])
# Mask image to only select "blue"
mask=cv2.inRange(hls,blue_lo,blue_hi)
# copy original image
img1 = img.copy()
img1[mask==0]=(255,255,255)
通过将图像转换为黑白来帮助 pytesseract
This is converting an image to black and white. Tesseract does this internally (Otsu algorithm), but the result can be suboptimal, particularly if the page background is of uneven darkness.
rgb = cv2.cvtColor(img1, cv2.COLOR_HLS2RGB)
gray = cv2.cvtColor(rgb, cv2.COLOR_RGB2GRAY)
_, img1 = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
cv2.imshow('img_to_binary',img1)
在之前创建的 img1
上使用 image_to_data
并继续应用您现有的代码。
...
hImg,wImg,_ = img.shape
#detecting words
boxes = pytesseract.image_to_data(img1)
for x,b in enumerate(boxes.splitlines()):
...
...