无法在 pytesseract 中将第二行名称作为单个单词

Unable to get the second line name as single word in pytesseract

我正在尝试使用 pytesseract 从图像中读取文本。图片在这里,

使用代码我能够阅读文本,但如果两行中列出了城市名称,它就会失败。例如,在图片中,Grand Junction 或 Monterey bay national marine sanctuary 应该被识别为单个词,但它们正在进入新的行。

代码:

act_image = cv2.imread('C:/Users/a463129/Downloads/chromedriver_win32/images/capture.png')
dimension = act_image.shape
image = act_image[0:dimension[0], 500:dimension[1]]
image = cv2.bitwise_not(image)
cv2.imshow("invert", image)
cv2.waitKey()

image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
kernel = np.ones((1, 1), np.uint8)
image = cv2.dilate(image, kernel, iterations=1)
image = cv2.erode(image, kernel, iterations=1)
image = cv2.GaussianBlur(image, (5, 5), 0)

img = image

img = cv2.resize(img,(0,0),fx=3,fy=3, interpolation=cv2.INTER_CUBIC)
img = cv2.medianBlur(img,5)
img = cv2.threshold(img,200,255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]


cv2.imshow('asd',cv2.resize(img,(0,0),fx=0.3,fy=0.3))
cv2.waitKey(0)
cv2.destroyAllWindows()

txt = pytesseract.image_to_string(img)

输出: Twin Falls, Medford m, Logan e, Sait Lake City a, Redding verna, NEVADA, Chico Reno, UTAH Grand, JUNCTION, Sacramento, San Francisco, San Jos▒ 内华达测试欧, MONTEREY AND TRAINING, CALIFORNIA MANGE (MTT RI St George , BAY NATIONAL, MARINE Fresno, SANCTUARY, Las Vegas, Gallup, Kingman, Santa Barbara Lancaster, ARIZONA, Los Angeles paim Springs

我是Whosebug的新手,这是我第一次回答问题。所以请原谅我任何形式的误导或不正确的回答。

考虑到您的图像是无噪声图像,我想通过仅将图像的那部分(裁剪后的图像)传递给 tesseract 来提取完整​​的城市名称。为此,我对图像使用了形态学操作来进行文本块分割并获得了轮廓的坐标。然后我裁剪了 otsu 图像并将其传递给 tesseract。

这里是 python 中的完整代码:

import cv2
import pytesseract
import numpy as np
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'

image = cv2.imread("act.png")
gray = cv2.cvtColor(image,cv2.COLOR_BGR2GRAY)
otsu = ~(cv2.threshold(gray,0,255,cv2.THRESH_BINARY+cv2.THRESH_OTSU)[1])
erode_otsu = cv2.erode(otsu,np.ones((7,7),np.uint8),iterations=1)
negated_erode = ~erode_otsu
dilated = cv2.dilate(negated_erode,np.ones((3,3),np.uint8),iterations=4)

contours_otsu,_ = cv2.findContours(dilated,cv2.RETR_TREE,cv2.CHAIN_APPROX_SIMPLE)
texts = []

for cnt in contours_otsu:
    x,y,w,h = cv2.boundingRect(cnt)
    mask = otsu[y:y+h,x:x+w]
    custom_oem_psm_config =  r'--oem 3 --psm 3'
    text = pytesseract.image_to_string(mask,lang='eng',config=custom_oem_psm_config)
    print(text)
    texts.append(text)

print(texts)
cv2.imwrite("dilated.jpg",dilated)

输出: ['', 'Palm Springs', 'Los Angeles', 'ARIZONA', 'Santa Barbara', 'Lancaster', 'Flagstaff', 'Kingman', 'Gallup', 'Las Vegas', 'Fresno', 'CALIFORNIA', 'St. George', 'MONTEREY\nBAY NATIONAL\nMARINE\nSANCTUARY', '', '', 'NEVADA TEST\nAND TRAINING\nRANGE (NTTR)', 'San José', 'San Francisco'、'Sacramento'、'Grand\nJunction'、'UTAH'、'Reno'、'Chico'、'NEVADA'、'Vernal'、'Redding', 'Salt Lake City', 'Eureka', '', '', 'Logan', '', 'Medford', 'Twin Falls']

好了,根据块分割的文本。 我假设您对此没有任何时间限制,因为代码需要花费大量时间,因为 image_to_string 函数在循环内使用。您还可以查看 image_to_data 函数。您也可以尝试清理输出文本或改用置信度。 谢谢。