在 Railo 中从 PDF 中提取文本

Question

刚接手编写 Railo 站点 (Railo 3.3.4.003)，我想索引大量 PDF。但是，cfindex 似乎只能索引文本文档。我看到有 <cfpdf action="extracttext">，但显然这在 Railo 中不受支持。任何人都可以确认或其他吗？如果不是最好的选择org.apache.pdfbox？

Answer 1

PDFBox 一定能胜任。 Railo class 路径中包含一个旧版本，但我发现它有问题。相反，我会使用 JavaLoader 加载最新版本。

pdfTextExtractor.cfc

/* The latest pre-built standalone PDFBox jar file and the javaloader package are assumed to be in the same folder as the following component */
component{

    function init( javaLoaderPath="javaloader.JavaLoader" ){
        if( !server.KeyExists( "_pdfBoxLoader" ) ){
            var paths=[];
            paths.append( GetDirectoryFromPath( GetCurrentTemplatePath() ) & "pdfbox-app-1.8.11.jar" );
            server._pdfBoxLoader=New "#javaLoaderPath#"( paths );
        }
        variables.reader=server._pdfBoxLoader.create( "org.apache.pdfbox.pdmodel.PDDocument" );
        variables.stripper=server._pdfBoxLoader.create( "org.apache.pdfbox.util.PDFTextStripper" );
        return this;
    }

    string function extractText( required string pdfPath, numeric startPage=0, numeric endPage=0 ){
        if( Val( startPage ) )
            stripper.setStartPage( startPage );
        if( Val( endPage ) )
            stripper.setEndPage( endPage );
        var pdf=reader.load( pdfPath );
        var text=stripper.getText( pdf );
        reader.close();
        return text;
    }

}

有关详细信息，请参阅 http://blog.simplicityweb.co.uk/94/migrating-from-coldfusion-to-railo-part-7-pdfs。

以上也适用于 Lucee，Railo 的继任者，我强烈建议迁移到它。

在 Railo 中从 PDF 中提取文本

Extract text from a PDF in Railo

pdf

lucene

indexing

railo

cfml