白名单后 tesseract 无法识别单个文本段

tesseract doesnt recognize individual text segments after whitelisting

我有一张图像,我想使用 tesseract 和 python 提取文本。我只想识别一组特定的字符,所以我使用 tessedit_char_whitelist=1234567890CBDE 作为配置。但是现在 tesseract 似乎不再识别线条之间的间隙。有没有我可以添加到白名单的字符,以便它再次将文本识别为单独的文本?

白名单后的图片如下:

这是白名单前的图片:

以下是负责绘制方框和识别字符的代码,以防您好奇:


#configuring parameters for tesseract
# whitlist = "-c tessedit_char_whitelist=1234567890CBDE"
custom_config = r'--oem 3 --psm 6 ' 
# now feeding image to tesseract
details = pytesseract.image_to_data(threshold_img, output_type=Output.DICT, config=custom_config, lang='eng')
print(details.keys())

total_boxes = len(details['text'])
for sequence_number in range(total_boxes):
    # confidence above 30 %
    CONFIDENCE = 0
    if int(details['conf'][sequence_number]) >= CONFIDENCE:
        (x, y, w, h) = (details['left'][sequence_number], details['top'][sequence_number], details['width'][sequence_number],  details['height'][sequence_number])
        threshold_img = cv2.rectangle(threshold_img, (x, y), (x + w, y + h), (0, 255, 0), 2)
# display image
cv2.imshow('captured text', threshold_img)
cv2.imwrite("before.png", threshold_img)
# Maintain output window until user presses a key
cv2.waitKey(0)
# Destroying present windows on screen
cv2.destroyAllWindows()

编辑:

这是我要从中提取文本的原始图像,目的是将其写入矩阵:

所需的矩阵将采用以下形式:


content = [
    ["1C", "55", "55", "E9", "BD"],
    # ...
    ["1C", "1C", "55", "BD", "BD"]
]

一个解决方案是:


    1. 分别取每个元组并上采样 2
    1. 应用threshold
    1. 通过将page-segmentation-mode设置为来识别

Tuple
Threshold
Result 1C 55 55 E9 BO
Tuple
Threshold
Result 1C 1C 55 BO 1C
Tuple
Threshold
Result 1C 55 BO 55 IC
Tuple
Threshold
Result 1C BD 50 1C 1C
Tuple
Threshold
Result 1C 1C 55 BD BD

这个想法是分别获取每个元组,对其进行上采样,然后应用 inverse-binary-threshold。由于字体的原因,Tesseract 误解了一些元组。例如,如果您查看看起来像 O 的字符 D。如果你想要 100% 的准确率,那么我建议你 train the tesseract. Also, make sure you try with other

这是数组输出:

[['1C', '55', '55', 'E9', 'BO'], ['1C', '1C', '55', 'BO', '1C'], ['1C', '55', 'BO', '55', 'IC'], ['1C', 'BD', '50', '1C', '1C'], ['1C', '1C', '55', 'BD', 'BD']]

代码:


import cv2
import pytesseract

img = cv2.imread("IVemF.png")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(h, w) = gry.shape[:2]
s_idx1 = 0  # start index1
e_idx1 = int(h/5)  # end index1
cfg = "--psm 6"
res = []

for _ in range(0, 5):
    s_idx2 = 0  # start index2
    e_idx2 = int(w / 5)  # end index2
    row = []
    for _ in range(0, 5):
        crp = gry[s_idx1:e_idx1, s_idx2:e_idx2]
        (h_crp, w_crp) = crp.shape[:2]
        crp = cv2.resize(crp, (w_crp*2, h_crp*2))
        thr = cv2.threshold(crp, 0, 255,
                            cv2.THRESH_BINARY_INV |
                            cv2.THRESH_OTSU)[1]
        txt = pytesseract.image_to_string(thr,
                                          config=cfg)
        txt = txt.replace("\n\x0c", "")
        row.append(txt.upper())
        print(txt.upper())
        s_idx2 = e_idx2
        e_idx2 = s_idx2 + int(w/5)
        cv2.imshow("thr", thr)
        cv2.waitKey(0)
    res.append(row)
    s_idx1 = e_idx1
    e_idx1 = s_idx1 + int(h/5)

print(res)