如何在 tesseract 5 alpha lstm 训练中从 .box 和 .tif 文件生成 lstmf

Question

我正在使用当前的 tesseract alpha 版本 5。目前，我正在尝试使用没有字体文件的图像进行训练。我设法使用以下命令从图像生成盒子文件。

tesseract image.tif imagebox -l ara wordstrbox

完成这一步后，我将修复 OCR 中的错误。然后我需要的是将box文件和tif文件转换成.lstmf文件。

我找不到关于如何执行此操作的任何指导。所有这些都是： OCR training documentation

The training data is provided via .lstmf files, which are serialized DocumentData They contain an image and the corresponding UTF8 text transcription, and can be generated from tif/box file pairs using Tesseract in a similar manner to the way .tr files were created for the old engine.

现阶段如何将tif和box转换成lstmf，请指教。

谢谢，

Answer 1

找到了，

tesseract image.tif training --psm 6 lstm.train

但是盒子文件名应该和图片文件名一样。

如何在 tesseract 5 alpha lstm 训练中从 .box 和 .tif 文件生成 lstmf

How to generate lstmf from .box and .tif files in tesseract 5 alpha lstm training

tesseract