Apache commons IO 如何将我的 XML header 从 UTF-8 转换为 UTF-16?
How does Apache commons IO convert my XML header from UTF-8 to UTF-16?
我正在使用 Java 6. 我有一个 XML 模板,它的开头是这样的
<?xml version="1.0" encoding="UTF-8"?>
但是,当我使用以下代码(使用 Apache Commons-io 2.4)解析和输出它时,我注意到…
Document doc = null;
InputStream in = this.getClass().getClassLoader().getResourceAsStream(“my-template.xml”);
try
{
byte[] data = org.apache.commons.io.IOUtils.toByteArray( in );
InputSource src = new InputSource(new StringReader(new String(data)));
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
doc = builder.parse(src);
}
finally
{
in.close();
}
第一行输出为
<?xml version="1.0" encoding="UTF-16”?>
当 parsing/outputting 文件时,我需要做什么才能使 header 编码保持“UTF-8”?
编辑: 根据给出的建议,我将代码更改为
Document doc = null;
InputStream in = this.getClass().getClassLoader().getResourceAsStream(name);
try
{
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
doc = builder.parse(in);
}
finally
{
in.close();
}
但是尽管我的输入元素模板文件的第一行是
<?xml version="1.0" encoding="UTF-8"?>
当我将文档输出为它生成的字符串时
<?xml version="1.0" encoding="UTF-16"?>
作为第一行。这是我用来将 "doc" object 作为字符串输出的内容 ...
private String getDocumentString(Document doc)
{
DOMImplementationLS domImplementation = (DOMImplementationLS)doc.getImplementation();
LSSerializer lsSerializer = domImplementation.createLSSerializer();
return lsSerializer.writeToString(doc);
}
new StringReader(new String(data))
这是错误的。您应该让解析器使用(例如)DocumentBuilder.parse(InputStream):
来检测文档编码
doc = builder.parse(in);
DOM序列化成什么编码取决于你怎么写。内存DOM没有编码的概念
正在将文档写入带有 UTF-8 声明的字符串:
import java.io.StringWriter;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.ls.*;
public class DomIO {
public static void main(String[] args) throws Exception {
Document doc = DocumentBuilderFactory.newInstance()
.newDocumentBuilder()
.newDocument();
doc.appendChild(doc.createElement("foo"));
System.out.println(getDocumentString(doc));
}
public static String getDocumentString(Document doc) {
DOMImplementationLS domImplementation = (DOMImplementationLS)
doc.getImplementation();
LSSerializer lsSerializer = domImplementation.createLSSerializer();
LSOutput lsOut = domImplementation.createLSOutput();
lsOut.setEncoding("UTF-8");
lsOut.setCharacterStream(new StringWriter());
lsSerializer.write(doc, lsOut);
return lsOut.getCharacterStream().toString();
}
}
LSOutput also has binary stream support 如果您希望序列化程序在输出时正确编码文档。
事实证明,当我将 Document -> String 方法更改为
private String getDocumentString(Document doc)
{
String ret = null;
DOMSource domSource = new DOMSource(doc);
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer;
try
{
transformer = tf.newTransformer();
transformer.transform(domSource, result);
ret = writer.toString();
}
catch (TransformerConfigurationException e)
{
e.printStackTrace();
}
catch (TransformerException e)
{
e.printStackTrace();
}
return ret;
}
'encoding="UTF-8"' headers 不再输出为 'encoding="UTF-16"'。
我正在使用 Java 6. 我有一个 XML 模板,它的开头是这样的
<?xml version="1.0" encoding="UTF-8"?>
但是,当我使用以下代码(使用 Apache Commons-io 2.4)解析和输出它时,我注意到…
Document doc = null;
InputStream in = this.getClass().getClassLoader().getResourceAsStream(“my-template.xml”);
try
{
byte[] data = org.apache.commons.io.IOUtils.toByteArray( in );
InputSource src = new InputSource(new StringReader(new String(data)));
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
doc = builder.parse(src);
}
finally
{
in.close();
}
第一行输出为
<?xml version="1.0" encoding="UTF-16”?>
当 parsing/outputting 文件时,我需要做什么才能使 header 编码保持“UTF-8”?
编辑: 根据给出的建议,我将代码更改为
Document doc = null;
InputStream in = this.getClass().getClassLoader().getResourceAsStream(name);
try
{
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
doc = builder.parse(in);
}
finally
{
in.close();
}
但是尽管我的输入元素模板文件的第一行是
<?xml version="1.0" encoding="UTF-8"?>
当我将文档输出为它生成的字符串时
<?xml version="1.0" encoding="UTF-16"?>
作为第一行。这是我用来将 "doc" object 作为字符串输出的内容 ...
private String getDocumentString(Document doc)
{
DOMImplementationLS domImplementation = (DOMImplementationLS)doc.getImplementation();
LSSerializer lsSerializer = domImplementation.createLSSerializer();
return lsSerializer.writeToString(doc);
}
new StringReader(new String(data))
这是错误的。您应该让解析器使用(例如)DocumentBuilder.parse(InputStream):
来检测文档编码doc = builder.parse(in);
DOM序列化成什么编码取决于你怎么写。内存DOM没有编码的概念
正在将文档写入带有 UTF-8 声明的字符串:
import java.io.StringWriter;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.ls.*;
public class DomIO {
public static void main(String[] args) throws Exception {
Document doc = DocumentBuilderFactory.newInstance()
.newDocumentBuilder()
.newDocument();
doc.appendChild(doc.createElement("foo"));
System.out.println(getDocumentString(doc));
}
public static String getDocumentString(Document doc) {
DOMImplementationLS domImplementation = (DOMImplementationLS)
doc.getImplementation();
LSSerializer lsSerializer = domImplementation.createLSSerializer();
LSOutput lsOut = domImplementation.createLSOutput();
lsOut.setEncoding("UTF-8");
lsOut.setCharacterStream(new StringWriter());
lsSerializer.write(doc, lsOut);
return lsOut.getCharacterStream().toString();
}
}
LSOutput also has binary stream support 如果您希望序列化程序在输出时正确编码文档。
事实证明,当我将 Document -> String 方法更改为
private String getDocumentString(Document doc)
{
String ret = null;
DOMSource domSource = new DOMSource(doc);
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer;
try
{
transformer = tf.newTransformer();
transformer.transform(domSource, result);
ret = writer.toString();
}
catch (TransformerConfigurationException e)
{
e.printStackTrace();
}
catch (TransformerException e)
{
e.printStackTrace();
}
return ret;
}
'encoding="UTF-8"' headers 不再输出为 'encoding="UTF-16"'。