如何使用 Tesseract 从该图像中获取文本?
How can I get text from this image with Tesseract?
目前我正在使用下面的代码从图像中获取文本并且它工作正常,但它不适用于这两张图像,似乎 tesseract 无法扫描这些类型的图像。请告诉我如何修复它
https://i.ibb.co/zNkbhKG/Untitled1.jpg
https://i.ibb.co/XVbjc3s/Untitled3.jpg
def read_screen():
spinner = Halo(text='Reading screen', spinner='bouncingBar')
spinner.start()
screenshot_file="Screens/to_ocr.png"
screen_grab(screenshot_file)
#prepare argparse
ap = argparse.ArgumentParser(description='HQ_Bot')
ap.add_argument("-i", "--image", required=False,default=screenshot_file,help="path to input image to be OCR'd")
ap.add_argument("-p", "--preprocess", type=str, default="thresh", help="type of preprocessing to be done")
args = vars(ap.parse_args())
# load the image
image = cv2.imread(args["image"])
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
if args["preprocess"] == "thresh":
gray = cv2.threshold(gray, 177, 177,
cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
elif args["preprocess"] == "blur":
gray = cv2.medianBlur(gray, 3)
# store grayscale image as a temp file to apply OCR
filename = "Screens/{}.png".format(os.getpid())
cv2.imwrite(filename, gray)
# load the image as a PIL/Pillow image, apply OCR, and then delete the temporary file
pytesseract.pytesseract.tesseract_cmd = 'C:\Program Files\Tesseract-OCR\tesseract.exe'
#ENG
#text = pytesseract.image_to_string(Image.open(filename))
#VIET
text = pytesseract.image_to_string(Image.open(filename), lang='vie')
os.remove(filename)
os.remove(screenshot_file)
# show the output images
'''cv2.imshow("Image", image)
cv2.imshow("Output", gray)
os.remove(screenshot_file)
if cv2.waitKey(0):
cv2.destroyAllWindows()
print(text)
'''
spinner.succeed()
spinner.stop()
return text
您应该尝试不同的 psm 模式,而不是像这样的默认模式:
target = pytesseract.image_to_string(im,config='--psm 4',lang='vie')
来自文档的练习:
Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.
例如,对于 /Untitled3.jpg
,您可以尝试 --psm 4
,如果失败,您可以对两者都尝试 --psm 11
。
根据您的 tesseract 版本,您还可以尝试不同的 oem 模式:
Use --oem 1 for LSTM, --oem 0 for Legacy Tesseract. Please note that Legacy Tesseract models are only included in traineddata files from tessdata repo.
编辑
此外,如您的图像所示,有两种语言,因此如果您希望使用 lang
参数,您需要手动将图像分成两部分,以免混淆 tesseract 引擎并使用不同的 lang
值为了他们。
编辑 2
下面是 Unitiled3 的完整工作示例。我注意到你对阈值的使用不当。您应该将 maxval
设置为大于阈值的值。就像在我的示例中一样,我将 thresh
设置为 177,但将 maxval
设置为 255,因此 177 以上的所有内容都将为黑色。我什至不需要做任何二值化。
import cv2
import pytesseract
from cv2.cv2 import imread, cvtColor, COLOR_BGR2GRAY, threshold, THRESH_BINARY
image = imread("./Untitled3.jpg")
image = cvtColor(image,COLOR_BGR2GRAY)
_,image = threshold(image,177,255,THRESH_BINARY)
cv2.namedWindow("TEST")
cv2.imshow("TEST",image)
cv2.waitKey()
text = pytesseract.image_to_string(image, lang='eng')
print(text)
输出:
New York, New York
Salzburg, Austria
Hollywood, California
目前我正在使用下面的代码从图像中获取文本并且它工作正常,但它不适用于这两张图像,似乎 tesseract 无法扫描这些类型的图像。请告诉我如何修复它
https://i.ibb.co/zNkbhKG/Untitled1.jpg
https://i.ibb.co/XVbjc3s/Untitled3.jpg
def read_screen():
spinner = Halo(text='Reading screen', spinner='bouncingBar')
spinner.start()
screenshot_file="Screens/to_ocr.png"
screen_grab(screenshot_file)
#prepare argparse
ap = argparse.ArgumentParser(description='HQ_Bot')
ap.add_argument("-i", "--image", required=False,default=screenshot_file,help="path to input image to be OCR'd")
ap.add_argument("-p", "--preprocess", type=str, default="thresh", help="type of preprocessing to be done")
args = vars(ap.parse_args())
# load the image
image = cv2.imread(args["image"])
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
if args["preprocess"] == "thresh":
gray = cv2.threshold(gray, 177, 177,
cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
elif args["preprocess"] == "blur":
gray = cv2.medianBlur(gray, 3)
# store grayscale image as a temp file to apply OCR
filename = "Screens/{}.png".format(os.getpid())
cv2.imwrite(filename, gray)
# load the image as a PIL/Pillow image, apply OCR, and then delete the temporary file
pytesseract.pytesseract.tesseract_cmd = 'C:\Program Files\Tesseract-OCR\tesseract.exe'
#ENG
#text = pytesseract.image_to_string(Image.open(filename))
#VIET
text = pytesseract.image_to_string(Image.open(filename), lang='vie')
os.remove(filename)
os.remove(screenshot_file)
# show the output images
'''cv2.imshow("Image", image)
cv2.imshow("Output", gray)
os.remove(screenshot_file)
if cv2.waitKey(0):
cv2.destroyAllWindows()
print(text)
'''
spinner.succeed()
spinner.stop()
return text
您应该尝试不同的 psm 模式,而不是像这样的默认模式:
target = pytesseract.image_to_string(im,config='--psm 4',lang='vie')
来自文档的练习:
Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.
例如,对于 /Untitled3.jpg
,您可以尝试 --psm 4
,如果失败,您可以对两者都尝试 --psm 11
。
根据您的 tesseract 版本,您还可以尝试不同的 oem 模式:
Use --oem 1 for LSTM, --oem 0 for Legacy Tesseract. Please note that Legacy Tesseract models are only included in traineddata files from tessdata repo.
编辑
此外,如您的图像所示,有两种语言,因此如果您希望使用 lang
参数,您需要手动将图像分成两部分,以免混淆 tesseract 引擎并使用不同的 lang
值为了他们。
编辑 2
下面是 Unitiled3 的完整工作示例。我注意到你对阈值的使用不当。您应该将 maxval
设置为大于阈值的值。就像在我的示例中一样,我将 thresh
设置为 177,但将 maxval
设置为 255,因此 177 以上的所有内容都将为黑色。我什至不需要做任何二值化。
import cv2
import pytesseract
from cv2.cv2 import imread, cvtColor, COLOR_BGR2GRAY, threshold, THRESH_BINARY
image = imread("./Untitled3.jpg")
image = cvtColor(image,COLOR_BGR2GRAY)
_,image = threshold(image,177,255,THRESH_BINARY)
cv2.namedWindow("TEST")
cv2.imshow("TEST",image)
cv2.waitKey()
text = pytesseract.image_to_string(image, lang='eng')
print(text)
输出:
New York, New York
Salzburg, Austria
Hollywood, California