调整 tesseract 以更好地检测图像中的 URL

Question

我有一张图片无法让 tesseract 识别为文本。我所有的输入文本都是 URL。

如您所见，图像已尽可能清晰。

当运行 tesseract test2.png stdout 它 returns http:II11111111111111111111111111111111111 1111111111111111111.coml 接近，但不正确。

将 tessedit_char_whitelist 参数设置为 htp:/1.com 时，它可以正确识别字符串（但我也希望对 URL 进行更全面的识别）。

使用命令行传入如下所示的模式文件tesseract test2.png stdout --user-patterns ./patterns.txt

\n\*://\n\*
http://\n\*
\n\*.com

对识别没有帮助。它仍然更喜欢 I 而不是 /。（有关 pattern file 的更多详细信息）

我也曾尝试将参数 ok_repeated_ch_non_alphanum_wds 设置为包括 /（和 chs_trailing_punct{1,2} 用于尾随 /，但它似乎不起作用。指定 --user-words 也没有帮助。（"words" 为 http://）

有没有办法为 tesseract 指定字符优先级？

版本信息：

$ tesseract -v
tesseract 3.04.01
 leptonica-1.73
  libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0

Answer 1

您可以通过将以下行添加到 unicharambigs 来实现此目的文件：

3 : I I 3 : / / 1

使用 combine_tessdata -e eng.traineddata eng.unicharambigs
编辑 unicharambigs 文件，例如使用 nano eng.unicharambigs（确保在 3 秒和第二个 / 之后使用制表符）。
用编辑后的版本覆盖traineddata文件中的unicharambigs文件combine_tessdata -o eng.traineddata eng.unicharambigs

使用修改后的训练数据文件输出：

$ tesseract test2.png stdout
http://11111111111111111111111111111111111
1111111111111111111.coml

调整 tesseract 以更好地检测图像中的 URL

Tweak tesseract for better detection of URLs in image

ocr

tesseract