PDFBox 2.0：克服字典键编码

Question

我正在使用 Apache PDFBox 2.0.1 从 PDF 表单中提取文本，提取 AcroForm 字段的详细信息。从单选按钮字段中，我挖掘了外观字典。我对 /N 和 /D 条目感兴趣（正常和 "down" 外观）。像这样（交互式Bean shell）：

field = form.getField(fieldName);
widgets = field.getWidgets();
print("Field Name: " + field.getPartialName() + " (" + widgets.size() + ")");
for (annot : widgets) {
  ap = annot.getAppearance();
  keys = ap.getCOSObject().getDictionaryObject("N").keySet();
  keyList = new ArrayList(keys.size());
  for (cosKey : keys) {keyList.add(cosKey.getName());}
  print(String.join("|", keyList));
}

输出为

Field Name: Krematorier (6)
Off|Skogskrem
Off|R�cksta
Off|Silverdal
Off|Stork�llan
Off|St Botvid
Nyn�shamn|Off

问号斑点应为瑞典语字符“ä”或“å”。使用 iText RUPS 我可以看到字典键是用 ISO-8859-1 编码的，而我猜 PDFBox 假定它们是 Unicode。

有没有办法使用 ISO-8859-1 解码密钥？或者任何其他正确检索密钥的方法？

此示例 PDF 表格可在此处下载：http://www.stockholm.se/PageFiles/85478/KYF%20211%20Best%C3%A4llning%202014.pdf

Answer 1

Using iText RUPS I can see that the dictionary keys are encoded with ISO-8859-1 while PDFBox assumes they are Unicode, I guess.

Is there any way of decoding the keys using ISO-8859-1? Or any other way to retrieve the keys correctly?

更改假定的编码

PDFBox 对名称中字节编码的解释（只有名称可以用作 PDF 中的字典键）发生在 BaseParser.parseCOSName() 从源 PDF 中读取名称时：

/**
 * This will parse a PDF name from the stream.
 *
 * @return The parsed PDF name.
 * @throws IOException If there is an error reading from the stream.
 */
protected COSName parseCOSName() throws IOException
{
    readExpectedChar('/');
    ByteArrayOutputStream buffer = new ByteArrayOutputStream();
    int c = seqSource.read();
    while (c != -1)
    {
        int ch = c;
        if (ch == '#')
        {
            int ch1 = seqSource.read();
            int ch2 = seqSource.read();
            if (isHexDigit((char)ch1) && isHexDigit((char)ch2))
            {
                String hex = "" + (char)ch1 + (char)ch2;
                try
                {
                    buffer.write(Integer.parseInt(hex, 16));
                }
                catch (NumberFormatException e)
                {
                    throw new IOException("Error: expected hex digit, actual='" + hex + "'", e);
                }
                c = seqSource.read();
            }
            else
            {
                // check for premature EOF
                if (ch2 == -1 || ch1 == -1)
                {
                    LOG.error("Premature EOF in BaseParser#parseCOSName");
                    c = -1;
                    break;
                }
                seqSource.unread(ch2);
                c = ch1;
                buffer.write(ch);
            }
        }
        else if (isEndOfName(ch))
        {
            break;
        }
        else
        {
            buffer.write(ch);
            c = seqSource.read();
        }
    }
    if (c != -1)
    {
        seqSource.unread(c);
    }
    String string = new String(buffer.toByteArray(), Charsets.UTF_8);
    return COSName.getPDFName(string);
}

如您所见，在读取名称字节并解释#转义序列后，PDFBox 无条件地将结果字节解释为 UTF-8 编码。因此，要更改此设置，您必须修补此 PDFBox class 并替换底部命名的字符集。

这里的 PDFBox 正确吗？

根据规范，将名称对象视为文本时

the sequence of bytes (after expansion of NUMBER SIGN sequences, if any) should be interpreted according to UTF-8, a variable-length byte-encoded representation of Unicode in which the printable ASCII characters have the same representations as in ASCII.

（第 7.3.5 节名称对象，ISO 32000-1）

BaseParser.parseCOSName() 实现了这一点。

PDFBox 的实现并不完全正确，因为在不需要的情况下将名称解释为字符串的行为已经是错误的：

name objects shall be treated as atomic within a PDF file. Ordinarily, the bytes making up the name are never treated as text to be presented to a human user or to an application external to a conforming reader. However, occasionally the need arises to treat a name object as text

因此，PDF 库应尽可能长时间地将名称作为字节数组处理，并且仅在明确需要时才找到字符串表示形式，并且只有在那时上面的建议（假定 UTF-8）才起作用。规范甚至指出了这可能会导致问题的地方：

PDF does not prescribe what UTF-8 sequence to choose for representing any given piece of externally specified text as a name object. In some cases, multiple UTF-8 sequences may represent the same logical text. Name objects defined by different sequences of bytes constitute distinct name objects in PDF, even though the UTF-8 sequences may have identical external interpretations.

另一种情况在手头的文档中变得明显，如果字节序列不构成有效的 UTF-8，它仍然是一个有效的名称。但是这样的名字被上面的方法改变了，任何不可解析的字节或子序列都被Unicode替换字符“�”替换了。因此，不同的名称可能会合并为一个名称。

另一个问题是，当写回 PDF 时，PDFBox 不是对称操作，而是解释名称的 String 表示（已检索为UTF-8 解释（如果从 PDF 读取）使用纯 US_ASCII，参见。 COSName.writePDF(OutputStream):

public void writePDF(OutputStream output) throws IOException
{
    output.write('/');
    byte[] bytes = getName().getBytes(Charsets.US_ASCII);
    for (byte b : bytes)
    {
        int current = (b + 256) % 256;

        // be more restrictive than the PDF spec, "Name Objects", see PDFBOX-2073
        if (current >= 'A' && current <= 'Z' ||
                current >= 'a' && current <= 'z' ||
                current >= '0' && current <= '9' ||
                current == '+' ||
                current == '-' ||
                current == '_' ||
                current == '@' ||
                current == '*' ||
                current == '$' ||
                current == ';' ||
                current == '.')
        {
            output.write(current);
        }
        else
        {
            output.write('#');
            output.write(String.format("%02X", current).getBytes(Charsets.US_ASCII));
        }
    }
}

因此，任何有趣的 Unicode 字符都将替换为 US_ASCII 默认替换字符，我假设它是“?”。

所以很幸运，PDF 名称通常只包含 ASCII 字符...;)

历史上

根据 PDF 1.4 参考中的实施说明，

In Acrobat 4.0 and earlier versions, a name object being treated as text will typically be interpreted in a host platform encoding, which depends on the operating system and the local language. For Asian languages, this encoding may be something like Shift-JIS or Big Five. Consequently, it will be necessary to distinguish between names encoded this way and ones encoded as UTF-8. Fortunately, UTF-8 encoding is very stylized and its use can usually be recognized. A name that is found not to conform to UTF-8 encoding rules can instead be interpreted according to host platform encoding.

因此，手头的示例文档似乎遵循 Acrobat 4 的约定，即上个世纪的约定。

_{源代码摘自 PDFBox 2.0.0 但乍一看似乎在 2.0.1 或开发主干中没有更改。}

PDFBox 2.0：克服字典键编码

PDFBox 2.0: Overcoming dictionary key encoding

character-encoding

pdfbox

更改假定的编码

这里的 PDFBox 正确吗？

历史上