PDFBox 2.0:克服字典键编码

PDFBox 2.0: Overcoming dictionary key encoding

我正在使用 Apache PDFBox 2.0.1 从 PDF 表单中提取文本,提取 AcroForm 字段的详细信息。从单选按钮字段中,我挖掘了外观字典。我对 /N 和 /D 条目感兴趣(正常和 "down" 外观)。像这样(交互式Bean shell):

field = form.getField(fieldName);
widgets = field.getWidgets();
print("Field Name: " + field.getPartialName() + " (" + widgets.size() + ")");
for (annot : widgets) {
  ap = annot.getAppearance();
  keys = ap.getCOSObject().getDictionaryObject("N").keySet();
  keyList = new ArrayList(keys.size());
  for (cosKey : keys) {keyList.add(cosKey.getName());}
  print(String.join("|", keyList));
}

输出为

Field Name: Krematorier (6)
Off|Skogskrem
Off|R�cksta
Off|Silverdal
Off|Stork�llan
Off|St Botvid
Nyn�shamn|Off

问号斑点应为瑞典语字符“ä”或“å”。使用 iText RUPS 我可以看到字典键是用 ISO-8859-1 编码的,而我猜 PDFBox 假定它们是 Unicode。

有没有办法使用 ISO-8859-1 解码密钥?或者任何其他正确检索密钥的方法?

此示例 PDF 表格可在此处下载:http://www.stockholm.se/PageFiles/85478/KYF%20211%20Best%C3%A4llning%202014.pdf

Using iText RUPS I can see that the dictionary keys are encoded with ISO-8859-1 while PDFBox assumes they are Unicode, I guess.

Is there any way of decoding the keys using ISO-8859-1? Or any other way to retrieve the keys correctly?

更改假定的编码

PDFBox 对名称中字节编码的解释(只有名称可以用作 PDF 中的字典键)发生在 BaseParser.parseCOSName() 从源 PDF 中读取名称时:

/**
 * This will parse a PDF name from the stream.
 *
 * @return The parsed PDF name.
 * @throws IOException If there is an error reading from the stream.
 */
protected COSName parseCOSName() throws IOException
{
    readExpectedChar('/');
    ByteArrayOutputStream buffer = new ByteArrayOutputStream();
    int c = seqSource.read();
    while (c != -1)
    {
        int ch = c;
        if (ch == '#')
        {
            int ch1 = seqSource.read();
            int ch2 = seqSource.read();
            if (isHexDigit((char)ch1) && isHexDigit((char)ch2))
            {
                String hex = "" + (char)ch1 + (char)ch2;
                try
                {
                    buffer.write(Integer.parseInt(hex, 16));
                }
                catch (NumberFormatException e)
                {
                    throw new IOException("Error: expected hex digit, actual='" + hex + "'", e);
                }
                c = seqSource.read();
            }
            else
            {
                // check for premature EOF
                if (ch2 == -1 || ch1 == -1)
                {
                    LOG.error("Premature EOF in BaseParser#parseCOSName");
                    c = -1;
                    break;
                }
                seqSource.unread(ch2);
                c = ch1;
                buffer.write(ch);
            }
        }
        else if (isEndOfName(ch))
        {
            break;
        }
        else
        {
            buffer.write(ch);
            c = seqSource.read();
        }
    }
    if (c != -1)
    {
        seqSource.unread(c);
    }
    String string = new String(buffer.toByteArray(), Charsets.UTF_8);
    return COSName.getPDFName(string);
}

如您所见,在读取名称字节并解释#转义序列后,PDFBox 无条件地将结果字节解释为 UTF-8 编码。因此,要更改此设置,您必须修补此 PDFBox class 并替换底部命名的字符集。

这里的 PDFBox 正确吗?

根据规范,将名称对象视为文本时

the sequence of bytes (after expansion of NUMBER SIGN sequences, if any) should be interpreted according to UTF-8, a variable-length byte-encoded representation of Unicode in which the printable ASCII characters have the same representations as in ASCII.

(第 7.3.5 节名称对象,ISO 32000-1

BaseParser.parseCOSName() 实现了这一点。

PDFBox 的实现并不完全正确,因为在不需要的情况下将名称解释为字符串的行为已经是错误的:

name objects shall be treated as atomic within a PDF file. Ordinarily, the bytes making up the name are never treated as text to be presented to a human user or to an application external to a conforming reader. However, occasionally the need arises to treat a name object as text

因此,PDF 库应尽可能长时间地将名称作为字节数组处理,并且仅在明确需要时才找到字符串表示形式,并且只有在那时上面的建议(假定 UTF-8)才起作用。规范甚至指出了这可能会导致问题的地方:

PDF does not prescribe what UTF-8 sequence to choose for representing any given piece of externally specified text as a name object. In some cases, multiple UTF-8 sequences may represent the same logical text. Name objects defined by different sequences of bytes constitute distinct name objects in PDF, even though the UTF-8 sequences may have identical external interpretations.

另一种情况在手头的文档中变得明显,如果字节序列不构成有效的 UTF-8,它仍然是一个有效的名称。但是这样的名字被上面的方法改变了,任何不可解析的字节或子序列都被Unicode替换字符“�”替换了。因此,不同的名称可能会合并为一个名称。

另一个问题是,当写回 PDF 时,PDFBox 不是 对称操作,而是解释名称的 String 表示(已检索为UTF-8 解释(如果从 PDF 读取)使用纯 US_ASCII,参见。 COSName.writePDF(OutputStream):

public void writePDF(OutputStream output) throws IOException
{
    output.write('/');
    byte[] bytes = getName().getBytes(Charsets.US_ASCII);
    for (byte b : bytes)
    {
        int current = (b + 256) % 256;

        // be more restrictive than the PDF spec, "Name Objects", see PDFBOX-2073
        if (current >= 'A' && current <= 'Z' ||
                current >= 'a' && current <= 'z' ||
                current >= '0' && current <= '9' ||
                current == '+' ||
                current == '-' ||
                current == '_' ||
                current == '@' ||
                current == '*' ||
                current == '$' ||
                current == ';' ||
                current == '.')
        {
            output.write(current);
        }
        else
        {
            output.write('#');
            output.write(String.format("%02X", current).getBytes(Charsets.US_ASCII));
        }
    }
}

因此,任何有趣的 Unicode 字符都将替换为 US_ASCII 默认替换字符,我假设它是“?”。

所以很幸运,PDF 名称通常只包含 ASCII 字符...;)

历史上

根据 PDF 1.4 参考中的实施说明,

In Acrobat 4.0 and earlier versions, a name object being treated as text will typically be interpreted in a host platform encoding, which depends on the operating system and the local language. For Asian languages, this encoding may be something like Shift-JIS or Big Five. Consequently, it will be necessary to distinguish between names encoded this way and ones encoded as UTF-8. Fortunately, UTF-8 encoding is very stylized and its use can usually be recognized. A name that is found not to conform to UTF-8 encoding rules can instead be interpreted according to host platform encoding.

因此,手头的示例文档似乎遵循 Acrobat 4 的约定,即上个世纪的约定。

源代码摘自 PDFBox 2.0.0 但乍一看似乎在 2.0.1 或开发主干中没有更改。