解码 ISO-8859-1 编码 XML 文档中的 Unicode 字符

Question

使用 javax.xml.transform 我创建了这个 ISO-8859-1 文档，其中包含两个 &# 编码的字符 쎼 和 쎶:

<?xml version="1.0" encoding="ISO-8859-1"?>
<xml>&#50108; and &#50102;</xml>

问题：符合标准的 XML reader 如何解释 쎼 和 쎶，

就像普通的 &# ... 字符串（不转换回 쎼 和 쎶）
作为쎼和쎶

生成 XML 的代码：

public void testInvalidCharacter() {
    try {
        String str = "\uC3BC and \uC3B6"; // 쎼 and 쎶
        System.out.println(str);

        DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
        Document doc = builder.newDocument();
        Element root = doc.createElement("xml");
        root.setTextContent(str);
        doc.appendChild(root);

        DOMSource domSource = new DOMSource(doc);

        Transformer transformer = TransformerFactory.newInstance().newTransformer();
        transformer.setOutputProperty(OutputKeys.ENCODING, StandardCharsets.ISO_8859_1.name());

        StringWriter out = new StringWriter();
        transformer.transform(domSource, new StreamResult(out));

        System.out.println(out.toString());

    } catch (ParserConfigurationException | DOMException | IllegalArgumentException | TransformerException e) {
        e.printStackTrace(System.err);
    }
}

Answer 1

XML 解析器将识别 '&#...' 转义语法并正确地 return 쎼 和 쎶 及其 API 元素的文本。例如。在 Java 中，带有标签名称 'xml' 的元素的 org.w3c.dom.Element.getTextContent() 方法将 return 具有该 Unicode 字符的字符串，尽管您的 XML 文档本身就是 ISO-8859-1

解码 ISO-8859-1 编码 XML 文档中的 Unicode 字符

Decoding of Unicode characters in a ISO-8859-1 encoded XML document

java

xml

unicode

iso-8859-1