使用 TesserOCR 读取收据

Question

我和我的室友厌倦了每次去杂货店购物时手动拆分收据（特别是 Costco），所以我想使用图像识别制作一个收据拆分器。

我正在使用 Tesserocr for python to convert pictures of the receipt into text, then match the text using regex and do calculations from there. The problem is that tesseract does a terrible job at converting image to text. Here's is a picture of one of our receipt，这是使用 api.SetImageFile(img) 和 api.GetUTF8Text() 后的输出：

”Wm-
Belfsville #214j
|0925 Balfimore Rve. (R1.
Belfsvlll B, MD 20705
4P Member 111869052983
E 1952 SNEET8SRTLY 11.79
E 0000165287 CPN/1952 3.80“
E 1952 SNEET&SHTLY 11.79
E 0000165287 CPN/1952 3.80-
87745 ROTISSERIE 4.99 H
1 5597 BLUEBERRIES 6.99
E. 5597 BIUEBERRIES 6.99
E. 979210 CHOC MRNGOS 9.99 H
F‘ 24311 VHR. MUFFIN 7.99
1 1060788 PRETZELCRISP 6.89
87745 ROTISSERIE 4.99 H
- 87745 ROTISSERIE 4.99 H
EZ 71096 RED DEL 7.99
El 1027557 KOREHNNOODLE 8.79
Ei 382861 KS IN CK BST 16.79
[S 91610 FROSTED FLKS 6.79
[3 11422 3 YR CHDR 12_27
[5 46849 SESNDPRKTEND 12.55
SUBTOTRL 13 _
THX 1,33
xu** TOTAL IIIIIIIBEEIHﬂI
xxxxx XXXXXXX4540 CHIP Read

您可以看到输出有点难以处理。它将 "A" 读作 "H"，有时将 "E" 读作 "F" 或其他随机内容。我想我有两个选择：

以某种方式训练 tesseract 以更好地读取收据，但我以前没有机器学习方面的经验。我试图阅读 Tesseract's trainning guide，但有很多我不熟悉的技术细节。我想实际过程并不困难，因为我正在阅读的图像非常具体。
给recipt拍多张照片，用Fred's ImageMagick Scripts之类的东西，把所有的图片都用不同的滤镜，把图片的所有排列通过tesseract，合并一个结果。问题是 1) 我不确定如何进行合并。使用正则表达式会很困难。 2）我想仍然会有基线问题，比如把 "A" 读成 "H".

任何人都可以帮我解决这两种选择吗？指出我完成这项工作的途径？或者告诉我另一种我可以尝试的方法？

Answer 1

如果您可以使用 ImageMagick 并且在类 Unix 系统上（Linux、MacOSX、Windows w/Cygwin 或 Windows 10 unix 环境），那么您可以在 http://www.fmwconcepts.com/imagemagick/index.php 上尝试我的 bash shell 脚本、textdeskew 和 textcleaner。例如：

textdeskew input.jpg deskew.png

deskew result

然后

textcleaner -f 25 -o 10 -g -e normalize -s 1 deskew.png deskew_clean.png

deskew and clean result

或者在 ImageMagick 中的任何 OS 上，只需使用 -deskew 和 -lat:

convert input.jpg -deskew 40% input_deskew.png

Imagemagick deskew result

convert input_deskew.png -negate -lat 25x25+10% -negate input_deskew_lat.png

Imagemagick deskew and lat result

或运行他们一起为：

convert input.jpg -deskew 40% -negate -lat 25x25+10% -negate input_deskew_lat.png

这些对您的 OCR 有帮助吗？

使用 TesserOCR 读取收据

Receipt reading using TesserOCR

python

regex

ocr

tesseract

machine-learning