为什么使用 pdfbox 在输出文件中显示方块而不是符号

Question

寻找解决方案浪费了一周时间，但仍然失败。也许认识某人：我尝试使用 pdfbox 将标记（例如 @test）替换为 .pdf 文件中的数字 123456。

它取代了它，但在输出而不是数字中，我有正方形或正方形内的问号或相互显示的数字。只有我意识到它取决于所选字体。而且我不知道错误在哪里。

注意：我们假设这是一个端口问题，并在 v 2.0 中构建 Java 进行测试并面临同样的问题。

也许有人遇到类似的问题并知道解决方案？

技术详情：

版本：PDFBox.NET-1.8.9，取自http://www.squarepdf.net/pdfbox-in-net
语言：C#
.NET 框架 4.5.2
使用的字体：times new roman、tahoma、courier、calibri。

MS Word 创建：

只需在桌面上右击
Select Microsoft Word 文档来自新建点
打印里面的文字：@test

脚本：

private void ReplaceTextInPdf(string inputPath, string outputPath) {
            PDDocument doc = null;
            try {
                File input = new File(inputPath);
                doc = PDDocument.loadNonSeq(input, null);
                List pages = doc.getDocumentCatalog().getAllPages();

                for (int i = 0; i < pages.size(); i++) {
                    PDPage page = (PDPage)pages.get(i);
                    PDStream contents = page.getContents();
                    PDFStreamParser parser = new PDFStreamParser(contents.getStream());
                    parser.parse();
                    List tokens = parser.getTokens();

                    for (int j = 0; j < tokens.size(); j++) {
                        Object next = tokens.get(j);
                        if (next is PDFOperator) {
                            PDFOperator op = (PDFOperator)next;
                            //Tj and TJ are the two operators that display
                            //strings in a PDF
                            if (op.getOperation() == "Tj") {
                                //Tj takes one operator and that is the string
                                //to display so lets update that operator
                                COSString previous = (COSString)tokens.get(j - 1);
                                String tempString = previous.getString();

                                tempString = tempString.replace("@test", "123456");

                                previous.reset();
                                previous.append(tempString.getBytes());
                            } else if (op.getOperation() == "TJ") {
                                String tempString = "";
                                COSString cosString = null;
                                COSArray previous = (COSArray)tokens.get(j - 1);
                                for (int k = 0; k < previous.size(); k++) {
                                    Object arrElement = previous.getObject(k);
                                    if (arrElement is COSString) {
                                        cosString = (COSString)arrElement;
                                        tempString += cosString.getString();
                                        cosString.reset();
                                    }
                                }

                                if (tempString != null && tempString.trim().length() > 0) {

                                    tempString = tempString.replace("@test", "123456");

                                    for (int k = 0; k < previous.size(); k++) {
                                        Object arrElement = previous.getObject(k);
                                        if (arrElement is COSString) {
                                            cosString.reset();
                                            cosString.append(tempString.getBytes("ISO-8859-1"));
                                            break;
                                        }
                                    }
                                }
                            }
                        }
                    }

                    //now that the tokens are updated we will replace the
                    //page content stream.
                    PDStream updatedStream = new PDStream(doc);
                    OutputStream out1 = updatedStream.createOutputStream();
                    ContentStreamWriter tokenWriter = new ContentStreamWriter(out1);
                    tokenWriter.writeTokens(tokens);
                    page.setContents(updatedStream);
                }

                doc.save(outputPath);
            } finally {
                if (doc != null) {
                    doc.close();
                }
            }
        }

Answer 1

一般

首先，您使用的代码仅在有利的情况下有效，即仅适用于以特殊方式生成的 PDF。虽然早些年的 PDF 经常是这样创建的，但现在它们大多不再是这样了。这导致删除了 PDFBox 示例，该示例的代码源自 PDFBox 2.0 的源代码库。

迁移指南中的匹配条目说明：

Why was the ReplaceText example removed?

The ReplaceText example has been removed as it gave the incorrect illusion that text can be replaced easily. Words are often split, as seen by this excerpt of a content stream:
[ (Do) -29 (c) -1 (umen) 30 (tation) ] TJ
Other problems will appear with font subsets: for example, if only the glyphs for a, b and c are used, these would be encoded as hex 0, 1 and 2, so you won’t find “abc”. Additionally, you can’t replace “c” with “d” because it isn’t part of the subset.

You could also have problems with ligatures, e.g. “ff”, “fl”, “fi”, “ffi”, “ffl”, which can be represented by a single code in many fonts. To understand this yourself, view any file with PDFDebugger and have a look at the “Contents” entry of a page.

See also PDFBox 2.0 RC3 -- Find and replace text

(Migration to PDFBox 2.0.0)

通过为 TJ 运算符连接字符串参数块，在您的代码中大部分都避免了因字距调整而导致的问题。不过，剩下的问题仍然存在。

对于您的示例文档

对于您的示例文档，问题在于替换“数字相互重叠”：

==>

原因类似于迁移指南中提到的“字体子集”问题。不过，有问题的 TTF 字体程序并未嵌入，因此这不是真正的“字体子集”问题。但 PDF 中存储的字体相关信息仅对原始 PDF 中实际使用的字形正确，即 '@'、'e'、's' 和 't'，但不适用于替换字形，即数字“1”到“6”。

与当前案例相关的字形特定信息是字形宽度：仅对于最初使用的字形，它是正确给出的，对于所有其他字形，给定的宽度是 0！结果：在绘制了一个替换字形后，绘制下一个字形的位置没有适当移动，而是保持不变（适用于 0 宽度字形），因此绘制的下一个字形从相同位置开始，有效地绘制你所有的替换字形都在彼此之上。

（更具体地说，该字体的宽度数组如下所示：

[ 250 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 921 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 444 0 0 0 0 0 0 0 0 0 0 0 0 0 389 278]

with '@', 'e', 's', and 't' 使用 WinAnsiEncoding 编码，字体由范围组成从“@”到 't'.)

在这种特殊情况下，您可能可以通过在您的 Word 模板中以不可见的方式（例如白底白字）打印一个字符串来解决此问题，该字符串包含您可能希望用作占位符替换的字体中的所有字符。

不过，一般来说，编码不需要像 WinAnsiEncoding 这样的 ASCII 编码，而是可以完全不同，甚至可以弥补这种情况，例如#1 用于页面上使用的第一个字形，#2 用于第二个，该页面上的不同字形等。因此，一般来说，解决方法并不那么容易找到。

为什么使用 pdfbox 在输出文件中显示方块而不是符号

Why squares shown instead of symbols in output file using pdfbox

c#

pdf

pdf-conversion

pdfbox

一般

Why was the ReplaceText example removed?

对于您的示例文档