PDFBox 2.0:克服字典键编码
PDFBox 2.0: Overcoming dictionary key encoding
我正在使用 Apache PDFBox 2.0.1 从 PDF 表单中提取文本,提取 AcroForm 字段的详细信息。从单选按钮字段中,我挖掘了外观字典。我对 /N 和 /D 条目感兴趣(正常和 "down" 外观)。像这样(交互式Bean shell):
field = form.getField(fieldName);
widgets = field.getWidgets();
print("Field Name: " + field.getPartialName() + " (" + widgets.size() + ")");
for (annot : widgets) {
ap = annot.getAppearance();
keys = ap.getCOSObject().getDictionaryObject("N").keySet();
keyList = new ArrayList(keys.size());
for (cosKey : keys) {keyList.add(cosKey.getName());}
print(String.join("|", keyList));
}
输出为
Field Name: Krematorier (6)
Off|Skogskrem
Off|R�cksta
Off|Silverdal
Off|Stork�llan
Off|St Botvid
Nyn�shamn|Off
问号斑点应为瑞典语字符“ä”或“å”。使用 iText RUPS 我可以看到字典键是用 ISO-8859-1 编码的,而我猜 PDFBox 假定它们是 Unicode。
有没有办法使用 ISO-8859-1 解码密钥?或者任何其他正确检索密钥的方法?
此示例 PDF 表格可在此处下载:http://www.stockholm.se/PageFiles/85478/KYF%20211%20Best%C3%A4llning%202014.pdf
Using iText RUPS I can see that the dictionary keys are encoded with ISO-8859-1 while PDFBox assumes they are Unicode, I guess.
Is there any way of decoding the keys using ISO-8859-1? Or any other way to retrieve the keys correctly?
更改假定的编码
PDFBox 对名称中字节编码的解释(只有名称可以用作 PDF 中的字典键)发生在 BaseParser.parseCOSName()
从源 PDF 中读取名称时:
/**
* This will parse a PDF name from the stream.
*
* @return The parsed PDF name.
* @throws IOException If there is an error reading from the stream.
*/
protected COSName parseCOSName() throws IOException
{
readExpectedChar('/');
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
int c = seqSource.read();
while (c != -1)
{
int ch = c;
if (ch == '#')
{
int ch1 = seqSource.read();
int ch2 = seqSource.read();
if (isHexDigit((char)ch1) && isHexDigit((char)ch2))
{
String hex = "" + (char)ch1 + (char)ch2;
try
{
buffer.write(Integer.parseInt(hex, 16));
}
catch (NumberFormatException e)
{
throw new IOException("Error: expected hex digit, actual='" + hex + "'", e);
}
c = seqSource.read();
}
else
{
// check for premature EOF
if (ch2 == -1 || ch1 == -1)
{
LOG.error("Premature EOF in BaseParser#parseCOSName");
c = -1;
break;
}
seqSource.unread(ch2);
c = ch1;
buffer.write(ch);
}
}
else if (isEndOfName(ch))
{
break;
}
else
{
buffer.write(ch);
c = seqSource.read();
}
}
if (c != -1)
{
seqSource.unread(c);
}
String string = new String(buffer.toByteArray(), Charsets.UTF_8);
return COSName.getPDFName(string);
}
如您所见,在读取名称字节并解释#转义序列后,PDFBox 无条件地将结果字节解释为 UTF-8 编码。因此,要更改此设置,您必须修补此 PDFBox class 并替换底部命名的字符集。
这里的 PDFBox 正确吗?
根据规范,将名称对象视为文本时
the sequence of bytes (after expansion of NUMBER SIGN sequences, if any) should be interpreted according to UTF-8, a variable-length byte-encoded representation of Unicode in which the printable ASCII characters have the same representations as in ASCII.
(第 7.3.5 节名称对象,ISO 32000-1)
BaseParser.parseCOSName()
实现了这一点。
PDFBox 的实现并不完全正确,因为在不需要的情况下将名称解释为字符串的行为已经是错误的:
name objects shall be treated as atomic within a PDF file. Ordinarily, the bytes making up the name are never treated as text to be presented to a human user or to an application external to a conforming reader. However, occasionally the need arises to treat a name object as text
因此,PDF 库应尽可能长时间地将名称作为字节数组处理,并且仅在明确需要时才找到字符串表示形式,并且只有在那时上面的建议(假定 UTF-8)才起作用。规范甚至指出了这可能会导致问题的地方:
PDF does not prescribe what UTF-8 sequence to choose for representing any given piece of externally specified text as a name object. In some cases, multiple UTF-8 sequences may represent the same logical text. Name objects defined by different sequences of bytes constitute distinct name objects in PDF, even though the UTF-8 sequences may have identical external interpretations.
另一种情况在手头的文档中变得明显,如果字节序列不构成有效的 UTF-8,它仍然是一个有效的名称。但是这样的名字被上面的方法改变了,任何不可解析的字节或子序列都被Unicode替换字符“�”替换了。因此,不同的名称可能会合并为一个名称。
另一个问题是,当写回 PDF 时,PDFBox 不是 对称操作,而是解释名称的 String
表示(已检索为UTF-8 解释(如果从 PDF 读取)使用纯 US_ASCII
,参见。 COSName.writePDF(OutputStream)
:
public void writePDF(OutputStream output) throws IOException
{
output.write('/');
byte[] bytes = getName().getBytes(Charsets.US_ASCII);
for (byte b : bytes)
{
int current = (b + 256) % 256;
// be more restrictive than the PDF spec, "Name Objects", see PDFBOX-2073
if (current >= 'A' && current <= 'Z' ||
current >= 'a' && current <= 'z' ||
current >= '0' && current <= '9' ||
current == '+' ||
current == '-' ||
current == '_' ||
current == '@' ||
current == '*' ||
current == '$' ||
current == ';' ||
current == '.')
{
output.write(current);
}
else
{
output.write('#');
output.write(String.format("%02X", current).getBytes(Charsets.US_ASCII));
}
}
}
因此,任何有趣的 Unicode 字符都将替换为 US_ASCII 默认替换字符,我假设它是“?”。
所以很幸运,PDF 名称通常只包含 ASCII 字符...;)
历史上
根据 PDF 1.4 参考中的实施说明,
In Acrobat 4.0 and earlier versions, a name object being treated as text will typically be interpreted in a host platform encoding, which depends on the operating system and the local language. For Asian languages, this encoding may be something like Shift-JIS or Big Five. Consequently, it will be necessary to distinguish between names encoded this way and ones encoded as UTF-8. Fortunately, UTF-8 encoding is very stylized and its use can usually be recognized. A name that is found not to conform to UTF-8 encoding rules can instead be interpreted according to host platform encoding.
因此,手头的示例文档似乎遵循 Acrobat 4 的约定,即上个世纪的约定。
源代码摘自 PDFBox 2.0.0 但乍一看似乎在 2.0.1 或开发主干中没有更改。
我正在使用 Apache PDFBox 2.0.1 从 PDF 表单中提取文本,提取 AcroForm 字段的详细信息。从单选按钮字段中,我挖掘了外观字典。我对 /N 和 /D 条目感兴趣(正常和 "down" 外观)。像这样(交互式Bean shell):
field = form.getField(fieldName);
widgets = field.getWidgets();
print("Field Name: " + field.getPartialName() + " (" + widgets.size() + ")");
for (annot : widgets) {
ap = annot.getAppearance();
keys = ap.getCOSObject().getDictionaryObject("N").keySet();
keyList = new ArrayList(keys.size());
for (cosKey : keys) {keyList.add(cosKey.getName());}
print(String.join("|", keyList));
}
输出为
Field Name: Krematorier (6)
Off|Skogskrem
Off|R�cksta
Off|Silverdal
Off|Stork�llan
Off|St Botvid
Nyn�shamn|Off
问号斑点应为瑞典语字符“ä”或“å”。使用 iText RUPS 我可以看到字典键是用 ISO-8859-1 编码的,而我猜 PDFBox 假定它们是 Unicode。
有没有办法使用 ISO-8859-1 解码密钥?或者任何其他正确检索密钥的方法?
此示例 PDF 表格可在此处下载:http://www.stockholm.se/PageFiles/85478/KYF%20211%20Best%C3%A4llning%202014.pdf
Using iText RUPS I can see that the dictionary keys are encoded with ISO-8859-1 while PDFBox assumes they are Unicode, I guess.
Is there any way of decoding the keys using ISO-8859-1? Or any other way to retrieve the keys correctly?
更改假定的编码
PDFBox 对名称中字节编码的解释(只有名称可以用作 PDF 中的字典键)发生在 BaseParser.parseCOSName()
从源 PDF 中读取名称时:
/**
* This will parse a PDF name from the stream.
*
* @return The parsed PDF name.
* @throws IOException If there is an error reading from the stream.
*/
protected COSName parseCOSName() throws IOException
{
readExpectedChar('/');
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
int c = seqSource.read();
while (c != -1)
{
int ch = c;
if (ch == '#')
{
int ch1 = seqSource.read();
int ch2 = seqSource.read();
if (isHexDigit((char)ch1) && isHexDigit((char)ch2))
{
String hex = "" + (char)ch1 + (char)ch2;
try
{
buffer.write(Integer.parseInt(hex, 16));
}
catch (NumberFormatException e)
{
throw new IOException("Error: expected hex digit, actual='" + hex + "'", e);
}
c = seqSource.read();
}
else
{
// check for premature EOF
if (ch2 == -1 || ch1 == -1)
{
LOG.error("Premature EOF in BaseParser#parseCOSName");
c = -1;
break;
}
seqSource.unread(ch2);
c = ch1;
buffer.write(ch);
}
}
else if (isEndOfName(ch))
{
break;
}
else
{
buffer.write(ch);
c = seqSource.read();
}
}
if (c != -1)
{
seqSource.unread(c);
}
String string = new String(buffer.toByteArray(), Charsets.UTF_8);
return COSName.getPDFName(string);
}
如您所见,在读取名称字节并解释#转义序列后,PDFBox 无条件地将结果字节解释为 UTF-8 编码。因此,要更改此设置,您必须修补此 PDFBox class 并替换底部命名的字符集。
这里的 PDFBox 正确吗?
根据规范,将名称对象视为文本时
the sequence of bytes (after expansion of NUMBER SIGN sequences, if any) should be interpreted according to UTF-8, a variable-length byte-encoded representation of Unicode in which the printable ASCII characters have the same representations as in ASCII.
(第 7.3.5 节名称对象,ISO 32000-1)
BaseParser.parseCOSName()
实现了这一点。
PDFBox 的实现并不完全正确,因为在不需要的情况下将名称解释为字符串的行为已经是错误的:
name objects shall be treated as atomic within a PDF file. Ordinarily, the bytes making up the name are never treated as text to be presented to a human user or to an application external to a conforming reader. However, occasionally the need arises to treat a name object as text
因此,PDF 库应尽可能长时间地将名称作为字节数组处理,并且仅在明确需要时才找到字符串表示形式,并且只有在那时上面的建议(假定 UTF-8)才起作用。规范甚至指出了这可能会导致问题的地方:
PDF does not prescribe what UTF-8 sequence to choose for representing any given piece of externally specified text as a name object. In some cases, multiple UTF-8 sequences may represent the same logical text. Name objects defined by different sequences of bytes constitute distinct name objects in PDF, even though the UTF-8 sequences may have identical external interpretations.
另一种情况在手头的文档中变得明显,如果字节序列不构成有效的 UTF-8,它仍然是一个有效的名称。但是这样的名字被上面的方法改变了,任何不可解析的字节或子序列都被Unicode替换字符“�”替换了。因此,不同的名称可能会合并为一个名称。
另一个问题是,当写回 PDF 时,PDFBox 不是 对称操作,而是解释名称的 String
表示(已检索为UTF-8 解释(如果从 PDF 读取)使用纯 US_ASCII
,参见。 COSName.writePDF(OutputStream)
:
public void writePDF(OutputStream output) throws IOException
{
output.write('/');
byte[] bytes = getName().getBytes(Charsets.US_ASCII);
for (byte b : bytes)
{
int current = (b + 256) % 256;
// be more restrictive than the PDF spec, "Name Objects", see PDFBOX-2073
if (current >= 'A' && current <= 'Z' ||
current >= 'a' && current <= 'z' ||
current >= '0' && current <= '9' ||
current == '+' ||
current == '-' ||
current == '_' ||
current == '@' ||
current == '*' ||
current == '$' ||
current == ';' ||
current == '.')
{
output.write(current);
}
else
{
output.write('#');
output.write(String.format("%02X", current).getBytes(Charsets.US_ASCII));
}
}
}
因此,任何有趣的 Unicode 字符都将替换为 US_ASCII 默认替换字符,我假设它是“?”。
所以很幸运,PDF 名称通常只包含 ASCII 字符...;)
历史上
根据 PDF 1.4 参考中的实施说明,
In Acrobat 4.0 and earlier versions, a name object being treated as text will typically be interpreted in a host platform encoding, which depends on the operating system and the local language. For Asian languages, this encoding may be something like Shift-JIS or Big Five. Consequently, it will be necessary to distinguish between names encoded this way and ones encoded as UTF-8. Fortunately, UTF-8 encoding is very stylized and its use can usually be recognized. A name that is found not to conform to UTF-8 encoding rules can instead be interpreted according to host platform encoding.
因此,手头的示例文档似乎遵循 Acrobat 4 的约定,即上个世纪的约定。
源代码摘自 PDFBox 2.0.0 但乍一看似乎在 2.0.1 或开发主干中没有更改。