Apache Tika 无法从大型 pdf 中提取全文内容

Question

我正在尝试使用 apache tika 从大型 pdf（不是 scanned/rasterized pdf）文件中提取文本。

但是当我比较原始（来自pdf）和提取文本时提取文本后，我发现，很多文本内容丢失了。我尝试使用 setMaxStringLength(-1) 和 BodyContentHandler(-1) 来最大化输出。但是仍然无法从pdf文件中提取全文内容。

以下是我试过的两个样本。

样本：1

public class Extract 
{
    public static void main( String[] args ) throws IOException, SAXException, TikaException
    {
        File file = new File("1.pdf");

        //Instantiating Tika facade class
        Tika tika = new Tika();
        tika.setMaxStringLength(-1);
        String filecontent = tika.parseToString(file);
        System.out.println("Extracted Content: " + filecontent);
    }
}

样本：2

public class Extract 
{
    public static void main( String[] args ) throws IOException, SAXException, TikaException
    {
        BodyContentHandler handler = new BodyContentHandler(-1); //-1 to allow parsing for unlimited character
        Metadata metadata = new Metadata();
        FileInputStream inputstream = new FileInputStream(new File("1.pdf"));
        ParseContext pcontext = new ParseContext();

        //parsing the document using PDF parser
        PDFParser pdfparser = new PDFParser(); 
        pdfparser.parse(inputstream, handler, metadata,pcontext);

        //getting the content of the document
        System.out.println("Contents of the PDF :" + handler.toString());

        //getting metadata of the document
        System.out.println("Metadata of the PDF:");
        String[] metadataNames = metadata.names();

        for(String name : metadataNames) {
            System.out.println(name+ " : " + metadata.get(name));
        }
    }
}

我可以看到 pdf 最后一页的内容。但是 pdf 中随机丢失了很多文本。

Answer 1

这是我犯的最愚蠢的错误。我从 Eclipse 控制台获取输出文件，该控制台的缓冲区有限 space。当我将输出写入文件时，它似乎是完美的。

Apache Tika 无法从大型 pdf 中提取全文内容

Apache Tika could not extract full text content from a large pdf

java

pdf

text-extraction

apache-tika