Apache commons IO 如何将我的 XML header 从 UTF-8 转换为 UTF-16？

Question

我正在使用 Java 6. 我有一个 XML 模板，它的开头是这样的

<?xml version="1.0" encoding="UTF-8"?>

但是，当我使用以下代码（使用 Apache Commons-io 2.4）解析和输出它时，我注意到…

    Document doc = null;
    InputStream in = this.getClass().getClassLoader().getResourceAsStream(“my-template.xml”);

    try
    {
        byte[] data = org.apache.commons.io.IOUtils.toByteArray( in );
        InputSource src = new InputSource(new StringReader(new String(data)));

        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder = factory.newDocumentBuilder();
        doc = builder.parse(src);
    }
    finally
    {
        in.close();
    }

第一行输出为

<?xml version="1.0" encoding="UTF-16”?>

当 parsing/outputting 文件时，我需要做什么才能使 header 编码保持“UTF-8”？

编辑： 根据给出的建议，我将代码更改为

    Document doc = null;
    InputStream in = this.getClass().getClassLoader().getResourceAsStream(name);

    try
    {
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder = factory.newDocumentBuilder();
        doc = builder.parse(in);
    }
    finally
    {
        in.close();
    }

但是尽管我的输入元素模板文件的第一行是

<?xml version="1.0" encoding="UTF-8"?>

当我将文档输出为它生成的字符串时

<?xml version="1.0" encoding="UTF-16"?>

作为第一行。这是我用来将 "doc" object 作为字符串输出的内容 ...

private String getDocumentString(Document doc)
{
    DOMImplementationLS domImplementation = (DOMImplementationLS)doc.getImplementation();
    LSSerializer lsSerializer = domImplementation.createLSSerializer();
    return lsSerializer.writeToString(doc);  
}

Answer 1

new StringReader(new String(data))

这是错误的。您应该让解析器使用（例如）DocumentBuilder.parse(InputStream):

来检测文档编码

doc = builder.parse(in);

DOM序列化成什么编码取决于你怎么写。内存DOM没有编码的概念

正在将文档写入带有 UTF-8 声明的字符串：

import java.io.StringWriter;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.ls.*;

public class DomIO {
    public static void main(String[] args) throws Exception {
        Document doc = DocumentBuilderFactory.newInstance()
                                             .newDocumentBuilder()
                                             .newDocument();
        doc.appendChild(doc.createElement("foo"));
        System.out.println(getDocumentString(doc));
    }

    public static String getDocumentString(Document doc) {
        DOMImplementationLS domImplementation = (DOMImplementationLS) 
                                                 doc.getImplementation();
        LSSerializer lsSerializer = domImplementation.createLSSerializer();
        LSOutput lsOut = domImplementation.createLSOutput();
        lsOut.setEncoding("UTF-8");
        lsOut.setCharacterStream(new StringWriter());
        lsSerializer.write(doc, lsOut);
        return lsOut.getCharacterStream().toString();
    }
}

LSOutput also has binary stream support 如果您希望序列化程序在输出时正确编码文档。

Answer 2

事实证明，当我将 Document -> String 方法更改为

private String getDocumentString(Document doc)
{
    String ret = null;
    DOMSource domSource = new DOMSource(doc);
    StringWriter writer = new StringWriter();
    StreamResult result = new StreamResult(writer);
    TransformerFactory tf = TransformerFactory.newInstance();
    Transformer transformer;
    try
    {
        transformer = tf.newTransformer();
        transformer.transform(domSource, result);
        ret = writer.toString();
    }
    catch (TransformerConfigurationException e)
    {
        e.printStackTrace();
    }
    catch (TransformerException e)
    {
        e.printStackTrace();
    }
    return ret;
}

'encoding="UTF-8"' headers 不再输出为 'encoding="UTF-16"'。

Apache commons IO 如何将我的 XML header 从 UTF-8 转换为 UTF-16？

How does Apache commons IO convert my XML header from UTF-8 to UTF-16?

java

utf-8

utf-16

document-conversion

apache-commons