使用PDFBox后的编码问题

Question

我必须

从 pdf 中提取文本，我大致使用这个

f = IOUtility.getFileForPath(filePath);
RandomAccessFile randomAccessFile = new RandomAccessFile(f, "r");
PDFParser parser = new PDFParser(randomAccessFile);
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(pdDoc.getNumberOfPages());
String parsedText = pdfStripper.getText(pdDoc);

缩放 PDF

File PDFFile = IOUtility.getFileForPath(scaleConfig.getFilePath());
document = PDDocument.load(PDFFile);

for (PDPage page : document.getPages()) {
    PDRectangle cropBox = page.getCropBox();
    float tx = ((cropBox.getLowerLeftX() + cropBox.getUpperRightX()) * 0.03f) / 2;
    float ty = ((cropBox.getLowerLeftY() + cropBox.getUpperRightY()) * 0.03f) / 2;
    PDPageContentStream cs = new PDPageContentStream(document, page, PDPageContentStream.AppendMode.PREPEND, false, false);
    cs.transform(Matrix.getScaleInstance(0.97f, 0.97f));
    cs.transform(Matrix.getTranslateInstance(tx, ty));
    cs.close();
}
document.save(scaleConfig.getTargetFilePath());

最后在pdf的每一页上写点东西。我使用此处提到的 14 种支持的字体之一 https://pdfbox.apache.org/1.8/cookbook/workingwithfonts.html。本例为 Times New Roman。

File PDFFile = IOUtility.getFileForPath(writeConfig.getFilePath());
document = PDDocument.load(PDFFile);
for (PDPage page : document.getPages()) {
    PDFBoxHelper.fixRotation(document, page);
    writeStringOnPage(document, page, writeConfig);
}
document.save(writeConfig.getTargetFilePath());

与writeStringOnPage一起

contentStream = new PDPageContentStream(document, page, PDPageContentStream.AppendMode.APPEND, false, true);
WriteCoordinates writeCoordinates = WriteCoordinateFactory.buildCoordinates(writeConfig, page.getMediaBox());
contentStream.beginText();
// lower left x and lower left y are different after rotation so use those for your calculation
contentStream.newLineAtOffset(writeCoordinates.getX(), writeCoordinates.getY());
contentStream.setFont(writeConfig.getFont(), writeConfig.getFontSize());
contentStream.setNonStrokingColor(writeConfig.getFontColor());
contentStream.showText(writeConfig.getToWrite());
contentStream.endText();

由于公司原因，我省略了签名和捕获块。我总是关闭内容流。

大多数情况下，处理后的 PDF 在 Chrome PDF-Viewer、Acrobat Reader 中以及将它们导入 BMD 后看起来都很好。但在某些特定情况下，我似乎有编码问题并且某些部分显示不正确。我在 PDF 上添加的所有文本始终正确显示。

我发现 PDF 中只有粗体打印的文本显示错误，所以我使用 Adobe Acrobat Reader 查看使用的字体。

Arial 和 Arial,Bold 是嵌入的，并使用 Identity-H 编码。由于一切都写成粗体，我得出结论，所有用 Arial,Bold 书写的文本都显示错误。处理 pdf 后，其他一切仍然很好。我无法添加 pdf，因为它有客户数据，但这里有一些示例：

Rechnungs-Nr: --> 5HFKQXQJV1U
60 Tage netto (27.12.2019) -> 7DJHQHWWR

如果 PDF 在没有 PDFBox 操作的情况下导入到 BMD 中，它会正确显示。

我试图通过仅缩放和写入来缩小问题范围，但两次都出现了问题。

我正在使用 PDFBox 2.017 和 Java 8.

因为当我只缩放 pdf 时也会发生错误我使用 PDFDebugger 来比较原始 PDF:

和缩放后的 pdf:

唯一看起来 different/off 的是内容条目。

当我打开缩放后的 PDF 并单击“字体”部分和“Arial,Bold”字体时，我收到了很多关于 unicode 映射的警告。尽管 PDF 在 PDFDebugger 中正确显示。

我既不是 PDFBox 方面的专家，也不是字体和编码方面的专家，因此非常感谢任何帮助！

Answer 1

简而言之

相关的区别是 PDFBox 以不同的方式序列化名称。但是根据 PDF 规范的不同输出是等效的，因此您显然已经发现了一个 WPViewPDF 错误。

写名字的区别

在原始 PDF (raw.pdf) 中，您会找到名称 NOWFJV+Arial,Bold 和 NOWFJV+Arial,Bold-WinCharSetFFFF，在 PDFBox 操作的所有文件中，您会发现在内容流之外出现的所有这些名称都被替换为 NOWFJV+Arial#2CBold 和 NOWFJV+Arial#2CBold- WinCharSetFFFF.

WPViewPDF 无法正确显示使用这些已更改名称的字体书写的文本。修补 PDF 以包含逗号代替这些名称中的“#2C”后，WPViewPDF 再次正确显示此类文本。

我假设 WPViewPDF 在内容流中找到 NOWFJV+Arial,Bold 并希望在页面资源中使用相同的书写名称找到匹配的字体定义，因此它不会'识别名称 NOWFJV+Arial#2CBold.

这是 PDFBox 的错误吗？

根据PDF规范，

Any character in a name that is a regular character (other than NUMBER SIGN) shall be written as itself or by using its 2-digit hexadecimal code, preceded by the NUMBER SIGN.

（ISO 32000-2，第 7.3.5 节“名称对象”）

因此，用“#2C”序列替换名称中的逗号是编写这些名称的完全有效的替代方法。

因此，不，这不是 PDFBox 错误，而显然是 WPViewPDF 错误。

使用PDFBox后的编码问题

Encoding Problems after using PDFBox

java

pdf

encoding

pdfbox

简而言之

写名字的区别

这是 PDFBox 的错误吗？