使用 pdfbox 替换 pdf 中的文本时出现错误字符
Bad characters when replacing text in pdf using pdfbox
我正在尝试替换 pdf 中的文本,它有点被替换了,这是我的代码
PDDocument doc = null;
int occurrences = 0;
try {
doc = PDDocument.load("test.pdf"); //Input PDF File Name
List pages = doc.getDocumentCatalog().getAllPages();
for (int i = 0; i < pages.size(); i++) {
PDPage page = (PDPage) pages.get(i);
PDStream contents = page.getContents();
PDFStreamParser parser = new PDFStreamParser(contents.getStream());
parser.parse();
List tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++) {
Object next = tokens.get(j);
if (next instanceof PDFOperator) {
PDFOperator op = (PDFOperator) next;
// Tj and TJ are the two operators that display strings in a PDF
if (op.getOperation().equals("Tj")) {
// Tj takes one operator and that is the string
// to display so lets update that operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
if (string.contains("Good")) {
string = string.replace("Good", "Bad");
occurrences++;
}
//Word you want to change. Currently this code changes word "Good" to "Bad"
previous.reset();
previous.append(string.getBytes("ISO-8859-1"));
} else if (op.getOperation().equals("TJ")) {
COSArray previous = (COSArray) tokens.get(j - 1);
COSString temp = new COSString();
String tempString = "";
for (int t = 0; t < previous.size(); t++) {
if (previous.get(t) instanceof COSString) {
tempString += ((COSString) previous.get(t)).getString();
}
}
temp.append(tempString.getBytes("ISO-8859-1"));
tempString = "";
tempString = temp.getString();
if (tempString.contains("Good")) {
tempString = tempString.replace("Good", "Bad");
occurrences++;
}
previous.clear();
String[] stringArray = tempString.split(" ");
for (String string : stringArray) {
COSString cosString = new COSString();
string = string + " ";
cosString.append(string.getBytes("ISO-8859-1"));
previous.add(cosString);
}
}
}
}
// now that the tokens are updated we will replace the page content stream.
PDStream updatedStream = new PDStream(doc);
OutputStream out = updatedStream.createOutputStream();
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
page.setContents(updatedStream);
}
System.out.println("number of matches found: " + occurrences);
doc.save("a.pdf"); //Output file name
} catch (IOException ex) {
Logger.getLogger(ReplaceTextInPDF.class.getName()).log(Level.SEVERE, null, ex);
} catch (COSVisitorException ex) {
Logger.getLogger(ReplaceTextInPDF.class.getName()).log(Level.SEVERE, null, ex);
} finally {
if (doc != null) {
try {
doc.close();
} catch (IOException ex) {
Logger.getLogger(ReplaceTextInPDF.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
它被替换为坏字符或隐藏形状的问题(例如,坏词变成了只有 d 字符),但如果我将它复制并粘贴到另一个地方,它会正确粘贴预期的词,
同样,当我在生成的 pdf 中搜索新词时,它没有找到它,但是当我用旧词搜索时,它会在替换的地方找到它
我找到了 aspose,这个 link 展示了如何使用它来替换 pdf 中的文本,它很简单并且工作完美,只是它不是免费的,所以免费版本是在 pdf 的头部打印版权行文件页
http://www.aspose.com/docs/display/pdfjava/Replace+Text+in+Pages+of+a+PDF+Document
我正在尝试替换 pdf 中的文本,它有点被替换了,这是我的代码
PDDocument doc = null;
int occurrences = 0;
try {
doc = PDDocument.load("test.pdf"); //Input PDF File Name
List pages = doc.getDocumentCatalog().getAllPages();
for (int i = 0; i < pages.size(); i++) {
PDPage page = (PDPage) pages.get(i);
PDStream contents = page.getContents();
PDFStreamParser parser = new PDFStreamParser(contents.getStream());
parser.parse();
List tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++) {
Object next = tokens.get(j);
if (next instanceof PDFOperator) {
PDFOperator op = (PDFOperator) next;
// Tj and TJ are the two operators that display strings in a PDF
if (op.getOperation().equals("Tj")) {
// Tj takes one operator and that is the string
// to display so lets update that operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
if (string.contains("Good")) {
string = string.replace("Good", "Bad");
occurrences++;
}
//Word you want to change. Currently this code changes word "Good" to "Bad"
previous.reset();
previous.append(string.getBytes("ISO-8859-1"));
} else if (op.getOperation().equals("TJ")) {
COSArray previous = (COSArray) tokens.get(j - 1);
COSString temp = new COSString();
String tempString = "";
for (int t = 0; t < previous.size(); t++) {
if (previous.get(t) instanceof COSString) {
tempString += ((COSString) previous.get(t)).getString();
}
}
temp.append(tempString.getBytes("ISO-8859-1"));
tempString = "";
tempString = temp.getString();
if (tempString.contains("Good")) {
tempString = tempString.replace("Good", "Bad");
occurrences++;
}
previous.clear();
String[] stringArray = tempString.split(" ");
for (String string : stringArray) {
COSString cosString = new COSString();
string = string + " ";
cosString.append(string.getBytes("ISO-8859-1"));
previous.add(cosString);
}
}
}
}
// now that the tokens are updated we will replace the page content stream.
PDStream updatedStream = new PDStream(doc);
OutputStream out = updatedStream.createOutputStream();
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
page.setContents(updatedStream);
}
System.out.println("number of matches found: " + occurrences);
doc.save("a.pdf"); //Output file name
} catch (IOException ex) {
Logger.getLogger(ReplaceTextInPDF.class.getName()).log(Level.SEVERE, null, ex);
} catch (COSVisitorException ex) {
Logger.getLogger(ReplaceTextInPDF.class.getName()).log(Level.SEVERE, null, ex);
} finally {
if (doc != null) {
try {
doc.close();
} catch (IOException ex) {
Logger.getLogger(ReplaceTextInPDF.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
它被替换为坏字符或隐藏形状的问题(例如,坏词变成了只有 d 字符),但如果我将它复制并粘贴到另一个地方,它会正确粘贴预期的词, 同样,当我在生成的 pdf 中搜索新词时,它没有找到它,但是当我用旧词搜索时,它会在替换的地方找到它
我找到了 aspose,这个 link 展示了如何使用它来替换 pdf 中的文本,它很简单并且工作完美,只是它不是免费的,所以免费版本是在 pdf 的头部打印版权行文件页 http://www.aspose.com/docs/display/pdfjava/Replace+Text+in+Pages+of+a+PDF+Document