从具有不同高度的 table 行中提取 pdf 文本（java 使用 pdfbox 库）

Question

黑色形状是需要提取的文本：

到目前为止，我已经从列中提取了文本，但是是手动提取的，因为只有 5 个（使用矩形 class 作为区域）。我的问题是：有没有办法对行执行此操作，因为矩形的大小（高度）不同并且手动将其执行到 50 多行将是一种暴行？更具体地说，我可以使用函数根据每一行的高度更改矩形吗？或者任何可能有帮助的建议？

Answer 1

如评论中所建议，您可以通过解析页面的矢量图形指令自动识别示例 PDF 的 table 单元格区域。

对于这样的任务，您可以扩展 PDFBox PDFGraphicsStreamEngine，它提供了调用路径构建和绘图指令的抽象方法。

注意：我在这里展示的流引擎 class 专门用于识别 table 单元格框架线绘制为示例文档中使用的黑色填充的长小矩形。对于一般解决方案，您至少还应该识别绘制为矢量图形线段或描边矩形的框架线。

流引擎class`PdfBoxFinder`

这个流引擎class收集水平线的y坐标范围和垂直线的x坐标范围和然后提供由这些坐标范围定义的网格框。特别是这意味着不支持行跨度或列跨度；在手头的情况下，这没关系，因为没有这样的跨度。

public class PdfBoxFinder extends PDFGraphicsStreamEngine {
    /**
     * Supply the page to analyze here; to analyze multiple pages
     * create multiple {@link PdfBoxFinder} instances.
     */
    public PdfBoxFinder(PDPage page) {
        super(page);
    }

    /**
     * The boxes ({@link Rectangle2D} instances with coordinates according to
     * the PDF coordinate system, e.g. for decorating the table cells) the
     * {@link PdfBoxFinder} has recognized on the current page.
     */
    public Map<String, Rectangle2D> getBoxes() {
        consolidateLists();
        Map<String, Rectangle2D> result = new HashMap<>();
        if (!horizontalLines.isEmpty() && !verticalLines.isEmpty())
        {
            Interval top = horizontalLines.get(horizontalLines.size() - 1);
            char rowLetter = 'A';
            for (int i = horizontalLines.size() - 2; i >= 0; i--, rowLetter++) {
                Interval bottom = horizontalLines.get(i);
                Interval left = verticalLines.get(0);
                int column = 1;
                for (int j = 1; j < verticalLines.size(); j++, column++) {
                    Interval right = verticalLines.get(j);
                    String name = String.format("%s%s", rowLetter, column);
                    Rectangle2D rectangle = new Rectangle2D.Float(left.from, bottom.from, right.to - left.from, top.to - bottom.from);
                    result.put(name, rectangle);
                    left = right;
                }
                top = bottom;
            }
        }
        return result;
    }

    /**
     * The regions ({@link Rectangle2D} instances with coordinates according
     * to the PDFBox text extraction API, e.g. for initializing the regions of
     * a {@link PDFTextStripperByArea}) the {@link PdfBoxFinder} has recognized
     * on the current page.
     */
    public Map<String, Rectangle2D> getRegions() {
        PDRectangle cropBox = getPage().getCropBox();
        float xOffset = cropBox.getLowerLeftX();
        float yOffset = cropBox.getUpperRightY();
        Map<String, Rectangle2D> result = getBoxes();
        for (Map.Entry<String, Rectangle2D> entry : result.entrySet()) {
            Rectangle2D box = entry.getValue();
            Rectangle2D region = new Rectangle2D.Float(xOffset + (float)box.getX(), yOffset - (float)(box.getY() + box.getHeight()), (float)box.getWidth(), (float)box.getHeight());
            entry.setValue(region);
        }
        return result;
    }

    /**
     * <p>
     * Processes the path elements currently in the {@link #path} list and
     * eventually clears the list.
     * </p>
     * <p>
     * Currently only elements are considered which 
     * </p>
     * <ul>
     * <li>are {@link Rectangle} instances;
     * <li>are filled fairly black;
     * <li>have a thin and long form; and
     * <li>have sides fairly parallel to the coordinate axis.
     * </ul>
     */
    void processPath() throws IOException {
        PDColor color = getGraphicsState().getNonStrokingColor();
        if (!isBlack(color)) {
            logger.debug("Dropped path due to non-black fill-color.");
            return;
        }

        for (PathElement pathElement : path) {
            if (pathElement instanceof Rectangle) {
                Rectangle rectangle = (Rectangle) pathElement;

                double p0p1 = rectangle.p0.distance(rectangle.p1);
                double p1p2 = rectangle.p1.distance(rectangle.p2);
                boolean p0p1small = p0p1 < 3;
                boolean p1p2small = p1p2 < 3;

                if (p0p1small) {
                    if (p1p2small) {
                        logger.debug("Dropped rectangle too small on both sides.");
                    } else {
                        processThinRectangle(rectangle.p0, rectangle.p1, rectangle.p2, rectangle.p3);
                    }
                } else if (p1p2small) {
                    processThinRectangle(rectangle.p1, rectangle.p2, rectangle.p3, rectangle.p0);
                } else {
                    logger.debug("Dropped rectangle too large on both sides.");
                }
            }
        }
        path.clear();
    }

    /**
     * The argument points shall be sorted to have (p0, p1) and (p2, p3) be the small
     * edges and (p1, p2) and (p3, p0) the long ones.
     */
    void processThinRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) {
        float longXDiff = (float)Math.abs(p2.getX() - p1.getX());
        float longYDiff = (float)Math.abs(p2.getY() - p1.getY());
        boolean longXDiffSmall = longXDiff * 10 < longYDiff;
        boolean longYDiffSmall = longYDiff * 10 < longXDiff;

        if (longXDiffSmall) {
            verticalLines.add(new Interval(p0.getX(), p1.getX(), p2.getX(), p3.getX()));
        } else if (longYDiffSmall) {
            horizontalLines.add(new Interval(p0.getY(), p1.getY(), p2.getY(), p3.getY()));
        } else {
            logger.debug("Dropped rectangle too askew.");
        }
    }

    /**
     * Sorts the {@link #horizontalLines} and {@link #verticalLines} lists and
     * merges fairly identical entries.
     */
    void consolidateLists() {
        for (List<Interval> intervals : Arrays.asList(horizontalLines, verticalLines)) {
            intervals.sort(null);
            for (int i = 1; i < intervals.size();) {
                if (intervals.get(i-1).combinableWith(intervals.get(i))) {
                    Interval interval = intervals.get(i-1).combineWith(intervals.get(i));
                    intervals.set(i-1, interval);
                    intervals.remove(i);
                } else {
                    i++;
                }
            }
        }
    }

    /**
     * Checks whether the given color is black'ish.
     */
    boolean isBlack(PDColor color) throws IOException {
        int value = color.toRGB();
        for (int i = 0; i < 2; i++) {
            int component = value & 0xff;
            if (component > 5)
                return false;
            value /= 256;
        }
        return true;
    }

    //
    // PDFGraphicsStreamEngine overrides
    //
    @Override
    public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException {
        path.add(new Rectangle(p0, p1, p2, p3));
    }

    @Override
    public void endPath() throws IOException {
        path.clear();
    }

    @Override
    public void strokePath() throws IOException {
        path.clear();
    }

    @Override
    public void fillPath(int windingRule) throws IOException {
        processPath();
    }

    @Override
    public void fillAndStrokePath(int windingRule) throws IOException {
        processPath();
    }

    @Override public void drawImage(PDImage pdImage) throws IOException { }
    @Override public void clip(int windingRule) throws IOException { }
    @Override public void moveTo(float x, float y) throws IOException { }
    @Override public void lineTo(float x, float y) throws IOException { }
    @Override public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException { }
    @Override public Point2D getCurrentPoint() throws IOException { return null; }
    @Override public void closePath() throws IOException { }
    @Override public void shadingFill(COSName shadingName) throws IOException { }

    //
    // inner classes
    //
    class Interval implements Comparable<Interval> {
        final float from;
        final float to;

        Interval(float... values) {
            Arrays.sort(values);
            this.from = values[0];
            this.to = values[values.length - 1];
        }

        Interval(double... values) {
            Arrays.sort(values);
            this.from = (float) values[0];
            this.to = (float) values[values.length - 1];
        }

        boolean combinableWith(Interval other) {
            if (this.from > other.from)
                return other.combinableWith(this);
            if (this.to < other.from)
                return false;
            float intersectionLength = Math.min(this.to, other.to) - other.from;
            float thisLength = this.to - this.from;
            float otherLength = other.to - other.from;
            return (intersectionLength >= thisLength * .9f) || (intersectionLength >= otherLength * .9f);
        }

        Interval combineWith(Interval other) {
            return new Interval(this.from, this.to, other.from, other.to);
        }

        @Override
        public int compareTo(Interval o) {
            return this.from == o.from ? Float.compare(this.to, o.to) : Float.compare(this.from, o.from);
        }

        @Override
        public String toString() {
            return String.format("[%3.2f, %3.2f]", from, to);
        }
    }

    interface PathElement {
    }

    class Rectangle implements PathElement {
        final Point2D p0, p1, p2, p3;

        Rectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) {
            this.p0 = p0;
            this.p1 = p1;
            this.p2 = p2;
            this.p3 = p3;
        }
    }

    //
    // members
    //
    final List<PathElement> path = new ArrayList<>();
    final List<Interval> horizontalLines = new ArrayList<>();
    final List<Interval> verticalLines = new ArrayList<>();
    final Logger logger = LoggerFactory.getLogger(PdfBoxFinder.class);
}

(PdfBoxFinder.java)

示例使用

您可以像这样使用 PdfBoxFinder 从位于 FILE_PATH:

的示例文档的 table 单元格中提取文本

try (   PDDocument document = PDDocument.load(FILE_PATH) ) {
    for (PDPage page : document.getDocumentCatalog().getPages()) {
        PdfBoxFinder boxFinder = new PdfBoxFinder(page);
        boxFinder.processPage(page);

        PDFTextStripperByArea stripperByArea = new PDFTextStripperByArea();
        for (Map.Entry<String, Rectangle2D> entry : boxFinder.getRegions().entrySet()) {
            stripperByArea.addRegion(entry.getKey(), entry.getValue());
        }

        stripperByArea.extractRegions(page);
        List<String> names = stripperByArea.getRegions();
        names.sort(null);
        for (String name : names) {
            System.out.printf("[%s] %s\n", name, stripperByArea.getTextForRegion(name));
        }
    }
}

(ExtractBoxedText 测试 testExtractBoxedTexts)

输出开始：

[A1] Nr. 
crt. 

[A2] Nume şi prenume 

[A3] Titlul lucrării 

[A4] Coordonator ştiinţific 

[A5] Ora 

[B1] 1. 

[B2] SFETCU I. JESSICA-
LARISA 

[B3] Analiza fluxurilor de date twitter 

[B4] Conf. univ. dr. Frîncu Marc 
Eduard 
 

[B5] 8:00 

[C1] 2. 

[C2] TARBA V. IONUȚ-
ADRIAN 

[C3] Test me - rest api folosind java şi 
play framework 

[C4] Conf.univ.dr. Fortiş Teodor 
Florin 
 

[C5] 8:12

文档第一页：

从具有不同高度的 table 行中提取 pdf 文本（java 使用 pdfbox 库）

Extracting text from pdf (java using pdfbox library) from a table's rows with different heights

java

rectangles

pdfbox

流引擎class`PdfBoxFinder`

示例使用

从具有不同高度的 table 行中提取 pdf 文本（java 使用 pdfbox 库）

Extracting text from pdf (java using pdfbox library) from a table's rows with different heights

java

rectangles

pdfbox

流引擎classPdfBoxFinder

示例使用

流引擎class`PdfBoxFinder`