Pytesseract 提高 OCR 准确性
Pytesseract Improve OCR Accuracy
我想从 python
中的图像中提取文本。为此,我选择了 pytesseract
。当我尝试从图像中提取文本时,结果并不令人满意。我还经历了 this 并实施了下面列出的所有技术。然而,它的表现似乎并不好。
图片:
代码:
import pytesseract
import cv2
import numpy as np
img = cv2.imread('D:\wordsimg.png')
img = cv2.resize(img, None, fx=1.2, fy=1.2, interpolation=cv2.INTER_CUBIC)
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
kernel = np.ones((1,1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)
img = cv2.threshold(cv2.medianBlur(img, 3), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
pytesseract.pytesseract.tesseract_cmd = 'C:\Program Files\Tesseract-OCR\tesseract.exe'
txt = pytesseract.image_to_string(img ,lang = 'eng')
txt = txt[:-1]
txt = txt.replace('\n',' ')
print(txt)
输出:
t hose he large form might light another us should took mountai house n story important went own own thought girl over family look some much ask the under why miss point make mile grow do own school was
即使是 1 个不需要的 space 也会让我付出很多代价。我希望结果 100% 准确。任何帮助,将不胜感激。谢谢!
我将调整大小从 1.2 更改为 2,并删除了所有预处理。我在 psm 11 和 psm 12
上取得了不错的成绩
import pytesseract
import cv2
import numpy as np
img = cv2.imread('wavy.png')
# img = cv2.resize(img, None, fx=1.2, fy=1.2, interpolation=cv2.INTER_CUBIC)
img = cv2.resize(img, None, fx=2, fy=2)
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
kernel = np.ones((1,1), np.uint8)
# img = cv2.dilate(img, kernel, iterations=1)
# img = cv2.erode(img, kernel, iterations=1)
# img = cv2.threshold(cv2.medianBlur(img, 3), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
cv2.imwrite('thresh.png', img)
pytesseract.pytesseract.tesseract_cmd = 'C:\Program Files (x86)\Tesseract-OCR\tesseract.exe'
for psm in range(6,13+1):
config = '--oem 3 --psm %d' % psm
txt = pytesseract.image_to_string(img, config = config, lang='eng')
print('psm ', psm, ':',txt)
config = '--oem 3 --psm %d' % psm
行使用 string interpolation (%) operator 将 %d
替换为整数 (psm)。我不太确定 oem
的作用,但我已经养成了使用它的习惯。有关此答案末尾 psm
的更多信息。
psm 11 : those he large form might light another us should name
took mountain story important went own own thought girl
over family look some much ask the under why miss point
make mile grow do own school was
psm 12 : those he large form might light another us should name
took mountain story important went own own thought girl
over family look some much ask the under why miss point
make mile grow do own school was
psm
是分页模式的简称。我不确定有哪些不同的模式。您可以从描述中了解代码是什么。您可以从 tesseract --help-psm
获取列表
Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR. (not implemented)
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.
我想从 python
中的图像中提取文本。为此,我选择了 pytesseract
。当我尝试从图像中提取文本时,结果并不令人满意。我还经历了 this 并实施了下面列出的所有技术。然而,它的表现似乎并不好。
图片:
代码:
import pytesseract
import cv2
import numpy as np
img = cv2.imread('D:\wordsimg.png')
img = cv2.resize(img, None, fx=1.2, fy=1.2, interpolation=cv2.INTER_CUBIC)
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
kernel = np.ones((1,1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)
img = cv2.threshold(cv2.medianBlur(img, 3), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
pytesseract.pytesseract.tesseract_cmd = 'C:\Program Files\Tesseract-OCR\tesseract.exe'
txt = pytesseract.image_to_string(img ,lang = 'eng')
txt = txt[:-1]
txt = txt.replace('\n',' ')
print(txt)
输出:
t hose he large form might light another us should took mountai house n story important went own own thought girl over family look some much ask the under why miss point make mile grow do own school was
即使是 1 个不需要的 space 也会让我付出很多代价。我希望结果 100% 准确。任何帮助,将不胜感激。谢谢!
我将调整大小从 1.2 更改为 2,并删除了所有预处理。我在 psm 11 和 psm 12
上取得了不错的成绩import pytesseract
import cv2
import numpy as np
img = cv2.imread('wavy.png')
# img = cv2.resize(img, None, fx=1.2, fy=1.2, interpolation=cv2.INTER_CUBIC)
img = cv2.resize(img, None, fx=2, fy=2)
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
kernel = np.ones((1,1), np.uint8)
# img = cv2.dilate(img, kernel, iterations=1)
# img = cv2.erode(img, kernel, iterations=1)
# img = cv2.threshold(cv2.medianBlur(img, 3), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
cv2.imwrite('thresh.png', img)
pytesseract.pytesseract.tesseract_cmd = 'C:\Program Files (x86)\Tesseract-OCR\tesseract.exe'
for psm in range(6,13+1):
config = '--oem 3 --psm %d' % psm
txt = pytesseract.image_to_string(img, config = config, lang='eng')
print('psm ', psm, ':',txt)
config = '--oem 3 --psm %d' % psm
行使用 string interpolation (%) operator 将 %d
替换为整数 (psm)。我不太确定 oem
的作用,但我已经养成了使用它的习惯。有关此答案末尾 psm
的更多信息。
psm 11 : those he large form might light another us should name
took mountain story important went own own thought girl
over family look some much ask the under why miss point
make mile grow do own school was
psm 12 : those he large form might light another us should name
took mountain story important went own own thought girl
over family look some much ask the under why miss point
make mile grow do own school was
psm
是分页模式的简称。我不确定有哪些不同的模式。您可以从描述中了解代码是什么。您可以从 tesseract --help-psm
Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR. (not implemented)
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.