Pytesseract 未检测到可能是图片中的图片的数字

Question

我正在尝试从下面给出的图像字符串中提取数字

我从普通文本中提取数字没有问题，但上面条带中的数字似乎是画中画。这是我用来提取数字的代码。

import pytesseract
from PIL import Image

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
img = Image.open(r"C:\Users\UserName\PycharmProjects\COLLEGE PROJ.png")
text=pytesseract.image_to_string(img, config='--psm 6')
file = open("c.txt", 'w')
file.write(text)
file.close()
print(text)

我已经尝试了从 1 到 13 所有可能的 psm，它们都只显示一周。如果我只裁剪出数字，代码就可以工作。但是我的项目要求我从类似的条带中提取它。有人可以帮我吗？一段时间以来，我一直专注于项目的这一方面。

我附上了完整的图片，以防它能帮助任何人更好地理解问题。

我可以在右边的文本中提取数字，但我无法从最左边的星期条中提取它！

Answer 1

首先，您需要对图像应用 adaptive-thresholding 和 bitwise-not 操作。

adaptive-thresholding之后：

bitwise-not之后：

要了解有关这些操作的更多信息，您可以查看 Morphological Transformations, Arithmetic Operations and Image Thresholding。

现在我们需要逐列阅读。

因此，设置分栏阅读需要page-segmentation-mode 4:

“4：假设有一列可变大小的文本。”

现在当我们阅读时：

txt = pytesseract.image_to_string(bnt, config="--psm 4")

结果：

WEEK © +4 hours te complete

5 Software

in the fifth week af this course, we'll learn about tcomputer software. We'll learn about what software actually is and the
.
.
.

我们有很多信息，我们只需要 5 和 6 值。

逻辑是：如果WEEK字符串在当前句子中可用，则获取下一行并打印：

txt = txt.strip().split("\n")
get_nxt_ln = False
for t in txt:
    if t and get_nxt_ln:
        print(t)
        get_nxt_ln = False
    if "WEEK" in t:
        get_nxt_ln = True

结果：

5 Software
: 6 Troubleshooting

现在只获取整数，我们可以使用regular-expression

t = re.sub("[^0-9]", "", t)
print(t)

结果：

5
6

代码：

import re
import cv2
import pytesseract

img = cv2.imread("BWSFU.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thr = cv2.adaptiveThreshold(gry, 255, cv2.ADAPTIVE_THRESH_MEAN_C,
                            cv2.THRESH_BINARY_INV, 11, 2)
bnt = cv2.bitwise_not(thr)
txt = pytesseract.image_to_string(bnt, config="--psm 4")
txt = txt.strip().split("\n")
get_nxt_ln = False
for t in txt:
    if t and get_nxt_ln:
        t = re.sub("[^0-9]", "", t)
        print(t)
        get_nxt_ln = False
    if "WEEK" in t:
        get_nxt_ln = True

Pytesseract 未检测到可能是图片中的图片的数字

Pytesseract not detecting a digit which might be a picture within a picture

python

ocr

python-tesseract