如何使用 OpenCV Pytesseract 从图像中从左到右提取单词?

How to extract words from left to right in an image with OpenCV Pytesseract?

我正在与 OpenCV 和 pytesseract 签订合同 sheet。我想从这张图片中提取文字

我正在尝试使用 getStructureElement,但我的代码跳到图像中心的下一行。我正在尝试从图像的左侧提取单词,然后从所有左侧提取字符串然后移动到图像的右侧。

密码是:

import cv2, import pytesseract, from PIL import Image

image = cv2.imread("report_name-1.jpg")

#preprocessing 

gray = cv2.cvtColor(image,cv2.COLOR_BGR2GRAY) # grayscale

thresh = cv2.threshold(gray,150,255,cv2.THRESH_BINARY_INV) # threshold

kernel = cv2.getStructuringElement(cv2.MORPH_CROSS,(3,3))

dilated = cv2.erode(thresh,kernel,iterations = 13) # dilate

contours, hierarchy =cv2.findContours(dilated,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_NONE) # get contours

# get rectangle bounding contour
[x,y,w,h] = cv2.boundingRect(contour)
# discard areas that are too large
if h>300 and w>300:
    continue

# discard areas that are too small
if h<40 or w<40:
    continue

# draw rectangle around contour on original image
cv2.rectangle(image,(x,y),(x+w,y+h),(255,0,255),2)

您可以使用 --psm 6 从左到右和从上到下提取文本,这告诉 Pytesseract 假定一个统一的文本块。预处理也很重要,因此我们设置阈值以获得具有所需黑色前景文本和白色背景文本的二值图像。查看 以获取其他 Pytesseract 配置选项。阈值处理后,这是我们放入 Pytesseract

的图像

这是输出

Limit Balance
Sep 29, 2015 ,750.0 Oct 01, 2018 [=10=].00 Oct 02, 2018
0
Account Condition: Paid account/zero Account #: Delinquency 30 Days = [=10=].00 | 60 Days =[=10=].00 90+ Days =[=10=].00 | Derog =00
balance 4636676005495602 Counter (Past
seven years)
Payment Status: This is an account in good Responsibility: Individual
standing
Account Type: Credit Card Account Term: REV
# Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2016 0 0 0
2017 0 0 0 0 0 0 0 0 0 0 0 0
2018 0 0 0 0 0 0 0 0 0 B
> BMW FINANCIAL SERVICES /
2602980
Open Date Original Amount Credit Status Date Chargeoff Amount Past Due Last Paid Date Balance Date Current
Limit Balance
Sep 19, 2015 ,189.00 Jul01, 2017 [=10=].00 Jul 21, 2017 Jul 24, 2017
Account Condition: Paid account/zero Account #: 4002206279 Delinquency 30 Days = [=10=].00 | 60 Days =[=10=].00 90+ Days =[=10=].00 | Derog =00
balance Counter (Past
seven years)
Payment Status: This is an account in good Responsibility: Individual
standing
Account Type: Auto Lease Account Term: 036
# Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2015 Cc Cc Cc Cc
2016 Cc Cc Cc Cc Cc Cc Cc Cc Cc Cc Cc Cc
2017 Cc Cc Cc Cc Cc Cc B
> LEXUS FINANCIAL SERVIC /
1624210
Open Date Original Amount Credit Status Date Chargeoff Amount Past Due Last Paid Date Balance Date Current
Limit Balance
Mar 07, 2015 ,342.00 Jul01, 2016 [=10=].00 Jul 05, 2016 Jul 31, 2016
Account Condition: Paid account/zero Account #: Delinquency 30 Days = [=10=].00 | 60 Days =[=10=].00 90+ Days =[=10=].00 | Derog =00
balance 70403662535410001 Counter (Past
seven years)
Payment Status: This is an account in good Responsibility: Individual
standing
Account Type: Auto Loan Account Term: 072
# Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2014
2015 Cc Cc Cc Cc Cc Cc Cc Cc Cc Cc
2016 Cc Cc Cc Cc Cc Cc B
> AES/SUNTRUST BANK / 9997195
Open Date Original Amount Credit Status Date Chargeoff Amount Past Due Last Paid Date Balance Date Current
Limit Balance
Sep 19, 2008 ,500.00 Apr 01, 2016 [=10=].00 Apr 21, 2016 Apr 30, 2016
Account Condition: Paid account/zero Account #: Delinquency 30 Days = [=10=].00 | 60 Days =[=10=].00 90+ Days =[=10=].00 | Derog =00
balance 5046237209PA00001 Counter (Past
seven years)
Payment Status: This is an account in good Responsibility: Signer
standing
Account Type: Education Loan Account Term: 300
# Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2014 Cc Cc Cc Cc Cc Cc Cc Cc Cc
2015 Cc Cc Cc Cc Cc Cc Cc Cc Cc Cc Cc Cc
2016 Cc Cc Cc B
> BARCLAYS BANK DELAWARE /
1223850
Open Date Original Amount Credit Status Date Chargeoff Amount Past Due Last Paid Date Balance Date Current
Limit Balance
Apr 04, 2013 ,500.00 Apr 01, 2016 [=10=].00 Oct 06, 2014 Apr 05, 2016
Account Condition: Paid account/zero Account #: 000176863399109 Delinquency 30 Days = [=10=].00 | 60 Days =[=10=].00 90+ Days =[=10=].00 | Derog =00
balance Counter (Past
seven years)
Payment Status: This is an account in good Responsibility: Individual
standing
Account Type: Credit Card Account Term: REV
# Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2014 Cc Cc Cc Cc Cc Cc Cc Cc 0
2015 0 0 0 0 0 0 0 0 0 0 0 0
2016 0 0 0 B
> AMERICAN HONDA FINANCE /
1605190
import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

image = cv2.imread('1.jpg')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

data = pytesseract.image_to_string(thresh, lang='eng',config='--psm 6')

print(data)