Pytesseract（Tesseract OCR）没有提取一些数字

Question

我一直在开发一个使用光学字符识别来读取财务报表的程序，但我一直无法弄清楚为什么开源模块仍然无法读取某些数字使用.

我创建了一个输出文件，在检测到文本的原始输入周围有绿色框。在这种情况下，带有“381”的行被选中，但下面的行（具有完全相同的格式）被忽略。

我正在使用这段代码在提取数据之前对图像进行预处理，因为之前的错误率高达 20%，现在接近 5%。

img = cv2.imread(filename)
img = cv2.resize(img, None, fx=1.2, fy=1.2, interpolation=cv2.INTER_CUBIC)
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
kernel = np.ones((1, 1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)

在此预处理之后，我还运行一种从文档中删除超过一定大小的实线的算法，但在这种情况下，“35”或“381”在原始文件中都没有下划线，所以我对此表示怀疑导致了这个问题。我还验证了线检测算法没有删除 5 的顶部。

我不是 OCR 或 CV 方面的专家，我的专长是更多数据和通用编程——我真的只需要让这个库完成它宣传的工作，这样我就可以继续前进并完成该程序。有谁知道可能导致此问题的原因吗？

Answer 1

我建议考虑将您的配置设置为特定的页面分割方法 (PSM)，例如 11，因为您正在寻找稀疏文本。例如，我的代码为：

results = pytesseract.image_to_data(Image.open(tempFile), lang='eng', config='--psm 11', output_type=Output.DICT)

PSM如下：

  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR.
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
                        bypassing hacks that are Tesseract-specific.

还有一种按数字而不是按文本进行搜索的方法，这可能也有帮助。

Pytesseract（Tesseract OCR）没有提取一些数字

Pytesseract (Tesseract OCR) not picking up some numbers

python

ocr

computer-vision

python-tesseract

cv2