如何提高OCR准确率？

Question

我有 2 张图片，如下所示。 A.png 被 tesseract 完美读取，但是 B.png 的准确性非常差，即使 B.png 类似于 A.png。我怎样才能提高准确性？我不知道从哪里开始调试？

A.png

B.png

运行光学字符识别

# tesseract -v
tesseract 4.1.1-rc2-22-g08899

# tesseract A.png stdout -l jpn --psm 6
Warning: Invalid resolution 0 dpi. Using 70 instead.
第 3 期 決算 公告 令 和 2 年 2 月 7 日
大 阪 市 中 央 区 南 新町 一 丁目 3 番 10 号
株 式 会 社 Link_Mobile

代表 取締 役 佐々 木 勉

貸借 対照 表 の 要旨 (平成 31 年 3 月 31 日 現在 }

# tesseract B.png stdout -l jpn --psm 6
Warning: Invalid resolution 0 dpi. Using 70 instead.
。 人 加計
区 三 6 番 12 号
中 野 駅 前 ビル 5 | 、
am 人 mw
に て
貸借 対照 表 の 要旨 ( 令 和 元 年 11 月 30 日 現在 }

更新 1

Were both scanned using the same scanner, and at the same resolution?

是的。原先包含在同一个 PDF 中的图片被剪掉了。

Are you taking advantage of any APIs which Tesseract exposes for pre-processing the images before doing OCR?

没有。我不知道。我现在正在检查它。

Answer 1

改善了。我阅读了“Tesseract documentation”并重新缩放了图像。

Rescaling Tesseract works best on images which have a DPI of at least 300 dpi, so it may be beneficial to resize images. For more information see the FAQ.

重新缩放图像

运行光学字符识别

# tesseract B2.png stdout -l jpn --psm 6
第 54 期 決 算 公 告 _ 令 和 2 年 1 月 29 日
東京 都 中 野 区 中 野 三 丁目 36 番 12 号
中 野 駅 前 ビル 5 F
株 式 会 社 コ ー エ ー テ クニ カ
代表 取締 役 小 空 _ 修
貸借 対照 表 の 要旨 ( 令 和 元 年 11 月 30 日 現在 )

如何提高OCR准确率？

How to improve OCR accuracy?

ocr

tesseract

python-tesseract