如何使用 tess4j 检测 pdf 中的文本块和列

Question

我是 Tesseract (tess4j) 的新手，设法使用了主要功能，例如阅读文本或从图像或 pdf 中获取单词位置、旋转等。

我找不到，也不确定是否可以轻松检测文本块（段落或列）。此外，如果 pdf 中还有一些其他块，如图像或其他东西，是否有可能以某种方式获取它，或者至少获取块（框）的位置。

Answer 1

可以使用TessBaseAPIGetComponentImagesAPI方法，如下：

Boxa boxes = api.TessBaseAPIGetComponentImages(handle, TessPageIteratorLevel.RIL_BLOCK, TRUE, null, null);

查看 Tess4J unit tests 以获得完整示例。

Answer 2

我已经接受了答案，但这是答案的结果：

public Page recognizeTextBlocks(Path path) {
        log.info("TessBaseAPIGetComponentImages");
        File image = new File(path.toString());
        Leptonica leptInstance = Leptonica.INSTANCE;
        Pix pix = leptInstance.pixRead(image.getPath());
        Page blocks = new Page(pix.w,pix.h);        
        api.TessBaseAPIInit3(handle, datapath, language);
        api.TessBaseAPISetImage2(handle, pix);
        api.TessBaseAPISetPageSegMode(handle, TessPageSegMode.PSM_AUTO_OSD);
        PointerByReference pixa = null;
        PointerByReference blockids = null;
        Boxa boxes = api.TessBaseAPIGetComponentImages(handle, TessPageIteratorLevel.RIL_BLOCK, FALSE, pixa, blockids);
        int boxCount = leptInstance.boxaGetCount(boxes);
        for (int i = 0; i < boxCount; i++) {
            Box box = leptInstance.boxaGetBox(boxes, i, L_CLONE);
            if (box == null) {
                continue;
            }
            api.TessBaseAPISetRectangle(handle, box.x, box.y, box.w, box.h);
            Pointer utf8Text = api.TessBaseAPIGetUTF8Text(handle);
            String ocrResult = utf8Text.getString(0);
            Block block = null;
            if(ocrResult == null || (ocrResult.replace("\n", "").replace(" ","")).length() == 0){
                block = new ImageBlock(new Rectangle(box.x, box.y, box.w, box.h));
            }else{
                block = new TextBlock(new Rectangle(box.x, box.y, box.w, box.h), ocrResult); 
            }
            blocks.add(block);
            api.TessDeleteText(utf8Text);
            int conf = api.TessBaseAPIMeanTextConf(handle);
            log.debug(String.format("Box[%d]: x=%d, y=%d, w=%d, h=%d, confidence: %d, text: %s", i, box.x, box.y, box.w, box.h, conf, ocrResult));
        }

        //release Pix resource
        PointerByReference pRef = new PointerByReference();
        pRef.setValue(pix.getPointer());
        leptInstance.pixDestroy(pRef);

        return blocks;
    }

注意：类 Block、ImageBlock 和 TextBlock 来自我的项目，不是 tess4j 或 tesseract 的一部分

如何使用 tess4j 检测 pdf 中的文本块和列

How to detect text blocks and columns in pdf with tess4j

java

ocr

tesseract

tess4j