在特定行之后获取 PDF 文件行

Question

我使用 Apache PDFBox 来解析 pdf 文件中的文本。我试图在特定行之后获取一行。

PDDocument document = PDDocument.load(new File("my.pdf"));
if (!document.isEncrypted()) {
    PDFTextStripper stripper = new PDFTextStripper();
    String text = stripper.getText(document);
    System.out.println("Text from pdf:" + text);
} else{
    log.info("File is encrypted!");
}
document.close();

样本：

Sentence 1, nth line of file

Needed line

Sentence 3, n+2th line of file

我试图从数组中的文件中获取所有行，但它不稳定，因为无法过滤到特定文本。这也是第二种解决方案中的问题，这就是为什么我正在寻找基于 PDFBox 的解决方案。解决方案 1：

String[] lines = myString.split(System.getProperty("line.separator"));

解决方案 2：

String neededline = (String) FileUtils.readLines(file).get("n+2th")

Answer 1

事实上，PDFTextStripper class 的 source code 使用与您完全相同的行结尾，因此您的第一次尝试使用 PDFBox 时尽可能接近正确。

你看，PDFTextStripper getText method calls the writeText method which just writes to an output buffer line by line with the writeString 方法与你已经尝试过的完全相同。此方法返回的结果是 buffer.toString().

因此，鉴于格式良好的 PDF，您真正要问的问题似乎是如何过滤特定文本的数组。以下是一些想法：

首先，你像你说的那样捕获数组中的行。

import java.io.File;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class Main {

    static String[] lines;

    public static void main(String[] args) throws Exception {
        PDDocument document = PDDocument.load(new File("my2.pdf"));
        PDFTextStripper stripper = new PDFTextStripper();
        String text = stripper.getText(document);
        lines = text.split(System.getProperty("line.separator"));
        document.close();
    }
}

这里有一个通过任意行号索引获取完整字符串的方法，很简单：

// returns a full String line by number n
static String getLine(int n) {
    return lines[n];
}

这是一种线性搜索方法，可以找到字符串匹配项和 returns 找到的第一行号。

// searches all lines for first line index containing `filter`
static int getLineNumberWithFilter(String filter) {
    int n = 0;
    for(String line : lines) {
        if(line.indexOf(filter) != -1) {
            return n;
        }
        n++;
    }
    return -1;
}

通过以上，可以只获取匹配搜索的行号：

System.out.println(getLine(8)); // line 8 for example

或者，包含匹配搜索的整个字符串行：

System.out.println(lines[getLineNumberWithFilter("Cat dog mouse")]);

这一切看起来都非常简单，并且仅在可以通过行分隔符将行拆分为数组的假设下才有效。如果解决方案不像上述想法那么简单，我相信您的问题的根源可能不在您使用 PDFBox 的实现中，而是在您尝试向我的 发送文本的 PDF 源代码中。

这是一个 link 教程，它也可以完成您想要做的事情：

https://www.tutorialkart.com/pdfbox/extract-text-line-by-line-from-pdf/

同样，同样的方法...

在特定行之后获取 PDF 文件行

Get line of PDF file after a specific line

java

string

file-io

text-processing

pdfbox