java -PDFBox 如何在不存储在数组中的情况下从文档中提取文本?
java -PDFBox how to extract text from documents without storing in an array?
我正在使用 PDFBox 从 PDF 文档中提取文本。然后,提取后,我会将这些文本插入到 MySQL 中的 table 中。
代码:
PDDocument document = PDDocument.load(new File(path1));
if (!document.isEncrypted()) {
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
String lines[] = pdfFileInText.split("\r?\n");
for (String line : lines) {
String[] words = line.split(" ");
String sql="insert IGNORE into test.indextable values (?,?);";
preparedStatement = con1.prepareStatement(sql);
int i=0;
for (String word : words) {
// check if one or more special characters at end of string then remove OR
// check special characters in beginning of the string then remove
// insert every word directly to table db
word=word.replaceAll("([\W]+$)|(^[\W]+)", "");
preparedStatement.setString(1, path1);
preparedStatement.setString(2, word);
/* preparedStatement.executeUpdate();
System.out.print("Add ");*/
preparedStatement.addBatch();
i++;
if (i % 1000 == 0) {
preparedStatement.executeBatch();
System.out.print("Add Thousand");
}
}
if (i > 0) {
preparedStatement.executeBatch();
System.out.print("Add Remaining");
}
}
}
代码工作正常,但正如您所看到的,如果文档很大并且其中包含大约 1000 万个单词,lines[]
将无济于事,并且会抛出 out of memory exception
。
我想不出解决办法。
有什么方法可以将单词直接提取并插入到数据库中,否则不可能吗?
已编辑:
这是我做的:
processText 方法:
public void processText(String text) throws SQLException {
String lines[] = text.split("\r?\n");
for (String line : lines) {
String[] words = line.split(" ");
String sql="insert IGNORE into test.indextable values (?,?);";
preparedStatement = con1.prepareStatement(sql);
int i=0;
for (String word : words) {
// check if one or more special characters at end of string then remove OR
// check special characters in beginning of the string then remove
// insert every word directly to table db
word=word.replaceAll("([\W]+$)|(^[\W]+)", "");
preparedStatement.setString(1, path1);
preparedStatement.setString(2, word);
preparedStatement.addBatch();
i++;
if (i % 1000 == 0) {
preparedStatement.executeBatch();
System.out.print("Add Thousand");
}
}
if (i > 0) {
preparedStatement.executeBatch();
System.out.print("Add Remaining");
}
}
preparedStatement.close();
System.out.println("Successfully commited changes to the database!");
}
索引方法(调用上面的方法):
public void index() throws Exception {
// Connection con1 = con.connect();
try {
// Connection con1=con.connect();
// Connection con1 = con.connect();
Statement statement = con1.createStatement();
ResultSet rs = statement.executeQuery("select * from filequeue where Status='Active' LIMIT 5");
while (rs.next()) {
// get the filepath of the PDF document
path1 = rs.getString(2);
int getNum = rs.getInt(1);
// while running the process, update status : Processing
//updateProcess_DB(getNum);
Statement test = con1.createStatement();
test.executeUpdate("update filequeue SET STATUS ='Processing' where UniqueID="+getNum);
try {
// call the index function
/*Indexing process = new Indexing();
process.index(path1);*/
PDDocument document = PDDocument.load(new File(path1));
if (!document.isEncrypted()) {
PDFTextStripper tStripper = new PDFTextStripper();
for(int p=1; p<=document.getNumberOfPages();++p) {
tStripper.setStartPage(p);
tStripper.setEndPage(p);
String pdfFileInText = tStripper.getText(document);
processText(pdfFileInText);
}
}
您当前的代码使用从 tStripper.getText(document);
收集的字符串 pdfFileInText
并立即获取整个文档。首先在一个单独的方法中重构您对该字符串(它以 pdfFileInText.split
开头)所做的所有操作,例如processText
。然后将您的代码更改为:
PDFTextStripper tStripper = new PDFTextStripper();
for (int p = 1; p <= document.getNumberOfPages(); ++p)
{
stripper.setStartPage(p); // 1-based
stripper.setEndPage(p); // 1-based
String pdfFileInText = tStripper.getText(document);
processText(pdfFileInText);
}
新代码分别处理每个页面。这样,您将能够以更小的步骤执行数据库插入,并且您不必存储文档的所有单词,只需存储一页的单词。
我正在使用 PDFBox 从 PDF 文档中提取文本。然后,提取后,我会将这些文本插入到 MySQL 中的 table 中。
代码:
PDDocument document = PDDocument.load(new File(path1));
if (!document.isEncrypted()) {
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
String lines[] = pdfFileInText.split("\r?\n");
for (String line : lines) {
String[] words = line.split(" ");
String sql="insert IGNORE into test.indextable values (?,?);";
preparedStatement = con1.prepareStatement(sql);
int i=0;
for (String word : words) {
// check if one or more special characters at end of string then remove OR
// check special characters in beginning of the string then remove
// insert every word directly to table db
word=word.replaceAll("([\W]+$)|(^[\W]+)", "");
preparedStatement.setString(1, path1);
preparedStatement.setString(2, word);
/* preparedStatement.executeUpdate();
System.out.print("Add ");*/
preparedStatement.addBatch();
i++;
if (i % 1000 == 0) {
preparedStatement.executeBatch();
System.out.print("Add Thousand");
}
}
if (i > 0) {
preparedStatement.executeBatch();
System.out.print("Add Remaining");
}
}
}
代码工作正常,但正如您所看到的,如果文档很大并且其中包含大约 1000 万个单词,lines[]
将无济于事,并且会抛出 out of memory exception
。
我想不出解决办法。 有什么方法可以将单词直接提取并插入到数据库中,否则不可能吗?
已编辑:
这是我做的:
processText 方法:
public void processText(String text) throws SQLException {
String lines[] = text.split("\r?\n");
for (String line : lines) {
String[] words = line.split(" ");
String sql="insert IGNORE into test.indextable values (?,?);";
preparedStatement = con1.prepareStatement(sql);
int i=0;
for (String word : words) {
// check if one or more special characters at end of string then remove OR
// check special characters in beginning of the string then remove
// insert every word directly to table db
word=word.replaceAll("([\W]+$)|(^[\W]+)", "");
preparedStatement.setString(1, path1);
preparedStatement.setString(2, word);
preparedStatement.addBatch();
i++;
if (i % 1000 == 0) {
preparedStatement.executeBatch();
System.out.print("Add Thousand");
}
}
if (i > 0) {
preparedStatement.executeBatch();
System.out.print("Add Remaining");
}
}
preparedStatement.close();
System.out.println("Successfully commited changes to the database!");
}
索引方法(调用上面的方法):
public void index() throws Exception {
// Connection con1 = con.connect();
try {
// Connection con1=con.connect();
// Connection con1 = con.connect();
Statement statement = con1.createStatement();
ResultSet rs = statement.executeQuery("select * from filequeue where Status='Active' LIMIT 5");
while (rs.next()) {
// get the filepath of the PDF document
path1 = rs.getString(2);
int getNum = rs.getInt(1);
// while running the process, update status : Processing
//updateProcess_DB(getNum);
Statement test = con1.createStatement();
test.executeUpdate("update filequeue SET STATUS ='Processing' where UniqueID="+getNum);
try {
// call the index function
/*Indexing process = new Indexing();
process.index(path1);*/
PDDocument document = PDDocument.load(new File(path1));
if (!document.isEncrypted()) {
PDFTextStripper tStripper = new PDFTextStripper();
for(int p=1; p<=document.getNumberOfPages();++p) {
tStripper.setStartPage(p);
tStripper.setEndPage(p);
String pdfFileInText = tStripper.getText(document);
processText(pdfFileInText);
}
}
您当前的代码使用从 tStripper.getText(document);
收集的字符串 pdfFileInText
并立即获取整个文档。首先在一个单独的方法中重构您对该字符串(它以 pdfFileInText.split
开头)所做的所有操作,例如processText
。然后将您的代码更改为:
PDFTextStripper tStripper = new PDFTextStripper();
for (int p = 1; p <= document.getNumberOfPages(); ++p)
{
stripper.setStartPage(p); // 1-based
stripper.setEndPage(p); // 1-based
String pdfFileInText = tStripper.getText(document);
processText(pdfFileInText);
}
新代码分别处理每个页面。这样,您将能够以更小的步骤执行数据库插入,并且您不必存储文档的所有单词,只需存储一页的单词。