使用 PDFbox 将 pdf 文件中的字符串替换为 unicode 文本？

Question

我需要从 PDF 文件中读取字符串并将其替换为 Unicode text.If 它是 ASCII 字符，一切都很好。但是对于 Unicode 字符，它显示问题 marks/junk text.No 问题与字体文件 (ttf) 我能够使用不同的 class (PDFContentStream) 将 unicode 文本写入 pdf 文件。使用此 class，没有替换文本的选项，但我们可以添加新文本。

示例 unicode 文本

Bɐɑɒ

问题（地址栏）

https://drive.google.com/file/d/1DbsApTCSfTwwK3txsDGW8sXtDG_u-VJv/view?usp=sharing

我正在使用 PDFBox。请帮我解决这个问题......

检查我使用的代码.....

    enter image description herepublic static PDDocument _ReplaceText(PDDocument document, String searchString, String replacement)
        throws IOException {
    if (StringUtils.isEmpty(searchString) || StringUtils.isEmpty(replacement)) {
        return document;
    }

    for (PDPage page : document.getPages()) {

        PDResources resources = new PDResources();
        PDFont font = PDType0Font.load(document, new File("arial-unicode-ms.ttf"));
        //PDFont font2 = PDType0Font.load(document, new File("avenir-next-regular.ttf"));
        resources.add(font);
        //resources.add(font2);
        //resources.add(PDType1Font.TIMES_ROMAN);
        page.setResources(resources);
        PDFStreamParser parser = new PDFStreamParser(page);
        parser.parse();
        List tokens = parser.getTokens();

        for (int j = 0; j < tokens.size(); j++) {
            Object next = tokens.get(j);
            if (next instanceof Operator) {
                Operator op = (Operator) next;

                String pstring = "";
                int prej = 0;

                // Tj and TJ are the two operators that display strings in a PDF
                if (op.getName().equals("Tj")) {
                    // Tj takes one operator and that is the string to display so lets update that
                    // operator
                    COSString previous = (COSString) tokens.get(j - 1);
                    String string = previous.getString();
                    string = string.replaceFirst(searchString, replacement);
                    previous.setValue(string.getBytes());
                } else if (op.getName().equals("TJ")) {
                    COSArray previous = (COSArray) tokens.get(j - 1);
                    for (int k = 0; k < previous.size(); k++) {
                        Object arrElement = previous.getObject(k);
                        if (arrElement instanceof COSString) {
                            COSString cosString = (COSString) arrElement;
                            String string = cosString.getString();

                            if (j == prej) {
                                pstring += string;
                            } else {
                                prej = j;
                                pstring = string;
                            }
                        }
                    }

                    if (searchString.equals(pstring.trim())) {
                        COSString cosString2 = (COSString) previous.getObject(0);
                        cosString2.setValue(replacement.getBytes());

                        int total = previous.size() - 1;
                        for (int k = total; k > 0; k--) {
                            previous.remove(k);
                        }
                    }
                }
            }
        }

        // now that the tokens are updated we will replace the page content stream.
        PDStream updatedStream = new PDStream(document);
        OutputStream out = updatedStream.createOutputStream(COSName.FLATE_DECODE);
        ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
        tokenWriter.writeTokens(tokens);
        out.close();
        page.setContents(updatedStream);
    }

    return document;
}

Answer 1

您的代码完全破坏了 PDF，请参见。 Adobe Preflight 输出：

原因很明显，你的代码

PDResources resources = new PDResources();
PDFont font = PDType0Font.load(document, new File("arial-unicode-ms.ttf"));
resources.add(font);
page.setResources(resources);

删除 pre-existing 页面资源并且您的替换仅包含一种字体，您允许 PDFBox 任意选择其名称。

您不得删除现有资源，因为它们已在您的文档中使用。

检查您的 PDF 页面的内容，很明显原来使用的字体编码 T1_0 和 T1_1要么是单字节编码，要么是混合的single/multi-byte编码；较低的单字节值似乎被编码为 ASCII-like.

我假设编码是 WinAnsiEncoding 或其子集。作为推论，你的任务

to read the strings from PDF file and replace it with the Unicode text

不能作为简单的替换来实现，至少不能考虑任意 Unicode 代码点。

您可以改为实施的是：

首先运行您的源 PDF 通过自定义的文本剥离器，而不是提取纯文本搜索要替换的字符串和 returns 它们的位置。这里有许多问题和答案向您展示如何确定文本剥离器子类中字符串的坐标，最近的一个是 .
接下来从您的 PDF 中删除那些原始字符串。在你的情况下，一种类似于你上面的原始代码的方法（显然没有删除资源），用同样长的空格字符串替换字符串可能会起作用，即使它是一个肮脏的 hack。
最后在附加模式下使用 PDFContentStream 在确定的位置添加您的替换；为此，将您的新字体添加到现有资源中。

不过请注意，PDF 并非设计用于此用途。模板 PDF 可用作新内容的背景，但尝试替换内容通常是一个糟糕的设计，会导致麻烦。如果您需要在模板中标记位置，请使用可以在 fill-in 期间轻松删除的注释。或使用原生 PDF 表单技术 AcroForm 表单。

使用 PDFbox 将 pdf 文件中的字符串替换为 unicode 文本？

replace string with unicode text in pdf file using PDFbox?

java

pdf

unicode

pdfbox

示例 unicode 文本

问题（地址栏）