训练 tesseract-OCR 4 的问题 - Empy 形状 table

Question

我正在尝试用特定图片训练 Tesseract 4（以读取 7 段万用表），

请注意，我知道 https://github.com/arturaugusto/display_ocr 来自 Arthur Augusto 的训练有素的数据，但我需要根据自己的数据训练 Tesseract。

为了训练苔丝，我遵循了不同的教程（如https://robipritrznik.medium.com/recognizing-vehicle-license-plates-on-images-using-tesseract-4-ocr-with-custom-trained-models-4ba9861595e7 or https://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/）

但是当运行形状聚类命令使用我自己的数据

时，我总是遇到问题

（示例数据为 https://github.com/tesseract-ocr/tesseract/issues/1174#issuecomment-338448972，一切正常）

事实上，当我尝试执行 shapeclusturing 命令时，它有这个输出 screenshot 然后我的 shape_table 是空的，训练效率不高...

使用示例数据，它工作正常并且 shape_table 填充得很好

我猜我在生成 box 文件时遇到了问题，这是我创建 box 文件的过程：

我用

tesseract imageFileName.tif imageFileName  batch.nochop makebox

命令生成盒子文件，然后我用 JtessboxEditor 编辑它。

所以我看不出我的 .box/.tif 数据对哪里出了问题。

祝你有美好的一天，谢谢你帮助我 \n 阿德里安

这是我在生成和编辑框文件后用于训练的完整批处理脚本。

set name=sev7.exp0
set shortName=sev7

echo Run Tesseract for Training.. 
tesseract.exe %name%.tif %name% nobatch box.train 
 
echo Compute the Character Set.. 
unicharset_extractor.exe %name%.box 

shapeclustering -F font_properties -U unicharset -O %shortName%.unicharset %name%.tr
mftraining -F font_properties -U unicharset -O %shortName%.unicharset %name%.tr
echo Clustering.. 
cntraining.exe %name%.tr
echo Rename Files.. 
rename normproto %shortName%.normproto 
rename inttemp %shortName%.inttemp 
rename pffmtable %shortName%.pffmtable 
rename shapetable %shortName%.shapetable
echo Create Tessdata.. 
combine_tessdata.exe %shortName%.
echo. & pause

Answer 1

好的，所以我终于实现了训练 tesseract。

解决办法是在使用命令

的时候加一个--psm参数

tesseract.exe %name%.tif %name% nobatch box.train

为

tesseract.exe %name%.%typeFile% %name%  --psm %psm% nobatch box.train

注意所有的psm值都是：

REM pagesegmode values are:

REM   0 = Orientation and script detection (OSD) only.
REM   1 = Automatic page segmentation with OSD.
REM   2 = Automatic page segmentation, but no OSD, or OCR
REM   3 = Fully automatic page segmentation, but no OSD. (Default)
REM   4 = Assume a single column of text of variable sizes.
REM   5 = Assume a single uniform block of vertically aligned text.
REM   6 = Assume a single uniform block of text.
REM   7 = Treat the image as a single text line.
REM   8 = Treat the image as a single word.
REM   9 = Treat the image as a single word in a circle.
REM   10 = Treat the image as a single character.
REM   11 = Sparse text. Find as much text as possible in no particular order.
REM   12    Sparse text with OSD.
REM   13    Raw line. Treat the image as a single text line bypassing hacks that are Tesseract-specific.

成立于 https://github.com/tesseract-ocr/tesseract/issues/434

训练 tesseract-OCR 4 的问题 - Empy 形状 table

Issue to train tesseract-OCR 4 - Empy shape table

ocr

tesseract