使用 iText ( .Net ) 从矩形中提取文本确实给了我整行

Extracting text from a rectangle using iText ( .Net ) does give me the entire line

以下是我用于从 pdf 中提取文本的代码(使用 iText for.Net 版本 7.0.4.0)。我在测试期间观察到的是,对于大多数 pdf,它仅提取矩形内的内容,效果很好。但对于他们中的少数人来说,它给出了 pdf 中的 整行。我知道

that the text snippets that intersect with the rect (so part of the text may be outside rect, iText doesn't cut text snippets in pieces).

但我想了解 pdf 中的哪些参数将在 iText 中用于拆分文本。

        var reader = new PdfReader( filePath );
        PdfDocument pdfDoc = new PdfDocument( reader );

        var addressRect = new Rectangle( 33, 190, 70, 42 ); // 

        var addressRegionFilter = new TextRegionEventFilter( addressRect );
        var filterListener = new FilteredTextEventListener( new LocationTextExtractionStrategy(), addressRegionFilter );
        var addressText = PdfTextExtractor.GetTextFromPage( pdfDoc.GetPage( 1 ), filterListener );

        pdfDoc.Close();

这应该可以解决问题。

class RectangleTextExtractionStrategy implements ITextExtractionStrategy
{

    private ITextExtractionStrategy innerStrategy = null;
    private Rectangle rectangle;

    public RectangleTextExtractionStrategy(ITextExtractionStrategy strategy, Rectangle rectangle)
    {
        this.innerStrategy = strategy;
        this.rectangle = rectangle;
    }

    @Override
    public String getResultantText() {
        return innerStrategy.getResultantText();
    }

    @Override
    public void eventOccurred(IEventData iEventData, EventType eventType) {
        if(eventType != EventType.RENDER_TEXT)
            return;
        TextRenderInfo tri = (TextRenderInfo) iEventData;
        for(TextRenderInfo subTri : tri.getCharacterRenderInfos())
        {
            Rectangle r2 = new CharacterRenderInfo(subTri).getBoundingBox();
            if(intersects(r2))
               innerStrategy.eventOccurred(subTri, EventType.RENDER_TEXT);
        }
    }

    private boolean intersects(Rectangle rectangle)
    {
        // # TODO
        return true;
    }

    @Override
    public Set<EventType> getSupportedEvents() {
        return innerStrategy.getSupportedEvents();
    }
}

这里的想法是将所有传入的 TextRenderInfo 对象拆分为其角色的相应事件。然后(如果它们在搜索区域中)我们将调用委托给另一个 ITextExtractionStrategy。