Tesseract/Leptonica 处理单页和多页图像的正确方法？

Question

我有几个关于 Tesseract（使用 leptonica）如何处理输入图像的问题。我在这里想做的是有一种方法可以处理任何图像文件（不需要特定格式）并稍后将其提供给 tesseract API，但这似乎不是正确的方法用 leptonica 做事...

这是我正在做的一个例子：

string tmpFile ="path/to/my/file";
// Trying to load a PIXA struct, since it can handle multipage images
PIXA* sourceImg =pixaRead(tmpFile.c_str());
if (sourceImg == NULL) {
    // this happen when pixaRead method fails to load the image
    // So we suppose it's a single page image-file.
    sourceImg =new PIXA;
    sourceImg->n =1;
    sourceImg->pix =(Pix**)malloc(sizeof(Pix*));
    assert(sourceImg->pix != NULL);
    sourceImg->pix[0] =pixRead(tmpFile.c_str());
    sourceImg->refcount =1;
}
api = new tesseract::TessBaseAPI();
if (api->Init(NULL, "eng")) {
    fprintf(stderr, "Could not initialize tesseract.\n");
    exit(1);
}
// Now we can process each pages
for(int i=0; i<sourceImg->n; i++) {
    // results is an object I use to save text from each documents,
    // with page count
    if (i > 0)
        results.addPage();
    Pix* image =sourceImg->pix[i];
    api->SetImage(image);
    // Get OCR result
    outText = api->GetUTF8Text();

    // Here I process stuff, not really important    
    int dummyPos=0;
    results.addLine(outText, dummyPos, dummyPos, dummyPos, dummyPos);
    delete [] outText;
}
pixaDestroy(&sourceImg);
api->End();

所以这是可行的，但不是我想要的方式，因为即使我使用多页 tiff，我在加载图像时也会收到以下消息：

Error in pixaReadStream: not a pixa file
Error in pixaRead: pixa not read

它仍然能够处理文档，这要归功于我在 "pixaRead" 失败时使用的 "pixRead" 方法...

有人可以向我解释一下我使用 "pixaRead" 函数有什么问题吗？是否可以用类似的方法处理单页和多页图像？

PS：我正在使用 Tesseract V4.0 和 Leptonica V1.74.4

提前致谢！

Answer 1

使用 pixaReadMultipageTiff 读取 TIFF 图像（单页或多页），pixRead 读取其他图像格式。

Tesseract/Leptonica 处理单页和多页图像的正确方法？

Tesseract/Leptonica proper way to handle single and multipage images?

c++

api

tesseract

leptonica