Tesseract:多页训练文件与多个单独文件相比的优势?

Tesseract: Advantage to Multi-Page Training File vs. Multiple Separate Files?

此 SO suggests that training tesseract with .tif files has an advantage over .png files because the .tif files can have multiple pages and thus a larger training sample. Yet, this SO question 讨论了同时使用多个图像进行训练的过程。更重要的是,man 页面,例如mftraining 表明它可以接受多个训练文件。

有什么理由不使用多个单独的图像文件进行训练吗?

您当然可以使用多个图像文件进行训练; Tesseract 会将它们视为具有不同的独立字体。并且图像数量有限制(64)。如果它们使用相同的字体,最好将它们放在多页 TIFF 中。根据其规范,一个 TIFF 文件可以是一个包含许多图像的容器。

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract https://en.wikipedia.org/wiki/Tagged_Image_File_Format

看来使用多个图像在单个字体上训练 tesseract 似乎工作得很好。以下是我采用的工作流程草图:

# Convert files to .pdf
convert -density 600 Page1.pdf eng1.MyNewFont.exp1.png
convert -density 600 Page2.pdf eng1.MyNewFont.exp2.png

# Create .box files
tesseract eng1.MyNewFont.exp1.png eng1.MyNewFont.exp1 -l eng batch.nochop makebox
tesseract eng1.MyNewFont.exp2.png eng1.MyNewFont.exp2 -l eng batch.nochop makebox

## correct boxes with jTessBoxEditor or another box editor ##

# Create two new box.tr files: eng1.MyNewFont.exp1.box.tr and eng1.MyNewFont.exp2.box.tr

tesseract eng1.MyNewFont.exp1.png eng1.MyNewFont.exp1.box -l eng1 nobatch box.train.stderr
tesseract eng1.MyNewFont.exp2.png eng1.MyNewFont.exp2.box -l eng1 nobatch box.train.stderr

# Extract characters from the two .box files
unicharset_extractor eng1.MyNewFont.exp1.box eng1.MyNewFont.exp2.box 

echo "MyNewFont 0 0 0 0 0" >> font_properties

# train using the two new box.tr files.
mftraining -F font_properties -U unicharset -O eng1.unicharset eng1.MyNewFont.exp1.box.tr eng1.MyNewFont.exp2.box.tr 
cntraining eng1.MyNewFont.exp1.box.tr eng1.MyNewFont.exp2.box.tr

## rename files
mv inttemp  eng1.inttemp
mv normproto  eng1.normproto
mv pffmtable  eng1.pffmtable
mv shapetable  eng1.shapetable

combine_tessdata eng1. ## create .traineddata file.