使用 PDFBox 获取文本行的位置

Question

我正在使用 PDFBox 从 pdf 中提取信息，而我当前试图查找的信息与行中第一个字符的 x 位置有关。我找不到任何与如何获取该信息相关的信息。我知道 pdfbox 有一个名为 TextPosition 的 class，但我也找不到如何从 PDDocument 获取 TextPosition 对象。如何从pdf中获取一行文本的位置信息？

Answer 1

一般

要使用 PDFBox 提取文本（有或没有额外信息，如位置、颜色等），您实例化一个 PDFTextStripper 或一个从它派生的 class 并像这样使用它：

PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);

（有许多 PDFTextStripper 属性允许您限制从中提取文本的页面。）

在 getText 的执行过程中，解析相关页面的内容流（以及从这些页面引用的 xObjects 形式的内容流）并处理文本绘制命令。

如果你想改变文本提取行为，你必须通过覆盖这个方法来改变你最经常应该做的这个文本绘制命令处理：

/**
 * Write a Java string to the output stream. The default implementation will ignore the <code>textPositions</code>
 * and just calls {@link #writeString(String)}.
 *
 * @param text The text to write to the stream.
 * @param textPositions The TextPositions belonging to the text.
 * @throws IOException If there is an error when writing the text.
 */
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
    writeString(text);
}

如果您还需要知道新行何时开始，您可能还想覆盖

/**
 * Write the line separator value to the output stream.
 * @throws IOException
 *             If there is a problem writing out the lineseparator to the document.
 */
protected void writeLineSeparator( ) throws IOException
{
    output.write(getLineSeparator());
}

writeString 可以被覆盖以将文本信息传送到单独的成员中（例如，如果您可能希望结果的格式比单纯的 String 更结构化），或者它可以被覆盖为简单在结果中添加一些额外信息 String.

writeLineSeparator 可以被覆盖以触发行之间的一些特定输出。

有更多方法可以重写，但通常您不太可能需要它们。

在手头的情况下

I'm using PDFBox to extract information from a pdf, and the information I'm currently trying to find is related to the x-position of the first character in the line.

这可以按如下方式实现（只需在每行的开头添加信息）：

PDFTextStripper stripper = new PDFTextStripper()
{
    @Override
    protected void startPage(PDPage page) throws IOException
    {
        startOfLine = true;
        super.startPage(page);
    }

    @Override
    protected void writeLineSeparator() throws IOException
    {
        startOfLine = true;
        super.writeLineSeparator();
    }

    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        if (startOfLine)
        {
            TextPosition firstProsition = textPositions.get(0);
            writeString(String.format("[%s]", firstProsition.getXDirAdj()));
            startOfLine = false;
        }
        super.writeString(text, textPositions);
    }
    boolean startOfLine = true;
};

text = stripper.getText(document);

(ExtractText.java 方法 extractLineStart 由 testExtractLineStartFromSampleFile 测试)

使用 PDFBox 获取文本行的位置

Using PDFBox to get location of line of text

java

pdf

pdfbox

一般

在手头的情况下