使用 Java 从多个 PDF 中提取文本
Extract text from multiple PDFs using Java
我有超过 1000 个 PDF 文件,需要从中提取文本并加载到 .txt 文件中。我可以获取单个 PDF 文件的代码,但无法从多个 PDF 中获取代码。我的代码如下 -
主要
package pdftest;`
import java.io.File;
import java.io.IOException;
public class JavaPDFTest {
public static void main(String[] args) throws IOException {
String path = "C:\Users\arunk01\Desktop\Java_Extraction\";
String files;
File folder = new File(path);
File[] listOfFiles = folder.listFiles();
for (int i = 0; i < listOfFiles.length; i++)
{
if (listOfFiles[i].isFile())
{
files = listOfFiles[i].getName();
if (files.endsWith(".pdf") || files.endsWith(".PDF"))
{
System.out.println(files);
String nfiles = "C:\Users\arunk01\Desktop\Java_Extraction\";
PDFManager pdfManager = new PDFManager();
String pdfToText = pdfManager.pdftoText(nfiles+files);
if (pdfToText == null) {
System.out.println("PDF to Text Conversion failed.");
}
else {
System.out.println("\nThe text parsed from the PDF Document....\n" + pdfToText);
pdfManager.writeTexttoFile(pdfToText,nfiles+files+".txt");
}
}
}
}
}
}
Class
package pdftest;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class PDFManager {
private PDFParser parser;
private PDFTextStripper pdfStripper;
private PDDocument pdDoc ;
private COSDocument cosDoc ;
private String pdftoText;
private String Text ;
private String filePath;
private File file;
public PDFManager() {
}
public String ToText() throws IOException
{
this.pdfStripper = null;
this.pdDoc = null;
this.cosDoc = null;
file = new File(filePath);
parser = new PDFParser(new RandomAccessFile(file,"r")); // update for PDFBox V 2.0
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdDoc.getNumberOfPages();
pdfStripper.setStartPage(1);
// pdfStripper.setEndPage(10);
// reading text from page 1 to 10
// if you want to get text from full pdf file use this code
pdfStripper.setEndPage(pdDoc.getNumberOfPages());
Text = pdfStripper.getText(pdDoc);
return Text;
}
public void setFilePath(String filePath) {
this.filePath = filePath;
}
public String pdftoText(String string) {
// TODO Auto-generated method stub
return Text;
}
public void writeTexttoFile(String pdfToText2, String string) {
// TODO Auto-generated method stub
}
}
我没有收到任何错误,但它说 PDF 到文本转换失败(在 Main 中遇到 if 条件)
2016__00002685__00.PDF
PDF to Text Conversion failed.
2016__00002685__01.PDF
PDF to Text Conversion failed.
2016__100018__00.PDF
PDF to Text Conversion failed.
2016__100018__01.PDF
PDF to Text Conversion failed.
谁能帮我编写将多个 PDF 转换为文本的代码。
谢谢,
阿伦
pdftoText
PDFManager
class returns 文本中的方法为空。您需要调用 ToText
方法。试试这个:
public String pdftoText(String filePath) throws IOException {
this.setFilePath(filePath);
return ToText();
}
除了@Unknown 的回答外,以下内容可能会有所帮助 PDFManager
。如果我们在 PDFManager
.
中只有一种方法 pdfToText()
或 ToText()
可能会更好
public String ToText() throws IOException{
PDDocument pdDoc=PDDocument(new File(filePath));
//startPage=1 endPage=Integer.MAX_VALUE by default.
return pdfStripper.getText(pdDoc);
}
我有超过 1000 个 PDF 文件,需要从中提取文本并加载到 .txt 文件中。我可以获取单个 PDF 文件的代码,但无法从多个 PDF 中获取代码。我的代码如下 -
主要
package pdftest;`
import java.io.File;
import java.io.IOException;
public class JavaPDFTest {
public static void main(String[] args) throws IOException {
String path = "C:\Users\arunk01\Desktop\Java_Extraction\";
String files;
File folder = new File(path);
File[] listOfFiles = folder.listFiles();
for (int i = 0; i < listOfFiles.length; i++)
{
if (listOfFiles[i].isFile())
{
files = listOfFiles[i].getName();
if (files.endsWith(".pdf") || files.endsWith(".PDF"))
{
System.out.println(files);
String nfiles = "C:\Users\arunk01\Desktop\Java_Extraction\";
PDFManager pdfManager = new PDFManager();
String pdfToText = pdfManager.pdftoText(nfiles+files);
if (pdfToText == null) {
System.out.println("PDF to Text Conversion failed.");
}
else {
System.out.println("\nThe text parsed from the PDF Document....\n" + pdfToText);
pdfManager.writeTexttoFile(pdfToText,nfiles+files+".txt");
}
}
}
}
}
}
Class
package pdftest;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class PDFManager {
private PDFParser parser;
private PDFTextStripper pdfStripper;
private PDDocument pdDoc ;
private COSDocument cosDoc ;
private String pdftoText;
private String Text ;
private String filePath;
private File file;
public PDFManager() {
}
public String ToText() throws IOException
{
this.pdfStripper = null;
this.pdDoc = null;
this.cosDoc = null;
file = new File(filePath);
parser = new PDFParser(new RandomAccessFile(file,"r")); // update for PDFBox V 2.0
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdDoc.getNumberOfPages();
pdfStripper.setStartPage(1);
// pdfStripper.setEndPage(10);
// reading text from page 1 to 10
// if you want to get text from full pdf file use this code
pdfStripper.setEndPage(pdDoc.getNumberOfPages());
Text = pdfStripper.getText(pdDoc);
return Text;
}
public void setFilePath(String filePath) {
this.filePath = filePath;
}
public String pdftoText(String string) {
// TODO Auto-generated method stub
return Text;
}
public void writeTexttoFile(String pdfToText2, String string) {
// TODO Auto-generated method stub
}
}
我没有收到任何错误,但它说 PDF 到文本转换失败(在 Main 中遇到 if 条件)
2016__00002685__00.PDF
PDF to Text Conversion failed.
2016__00002685__01.PDF
PDF to Text Conversion failed.
2016__100018__00.PDF
PDF to Text Conversion failed.
2016__100018__01.PDF
PDF to Text Conversion failed.
谁能帮我编写将多个 PDF 转换为文本的代码。
谢谢, 阿伦
pdftoText
PDFManager
class returns 文本中的方法为空。您需要调用 ToText
方法。试试这个:
public String pdftoText(String filePath) throws IOException {
this.setFilePath(filePath);
return ToText();
}
除了@Unknown 的回答外,以下内容可能会有所帮助 PDFManager
。如果我们在 PDFManager
.
pdfToText()
或 ToText()
可能会更好
public String ToText() throws IOException{
PDDocument pdDoc=PDDocument(new File(filePath));
//startPage=1 endPage=Integer.MAX_VALUE by default.
return pdfStripper.getText(pdDoc);
}