Xalan Transformer 转换打破 Unicode 6 星体字符
Xalan Transformer transform breaks Unicode 6 astral characters
角色 1F48B 是在 Unicode 6.0
中引入的
Unicode 6.0 support was introduced in Java 7.
我无法让 Xalan 2.7.2 的序列化程序正确写入该字符;相反,它写
在下游,事情会变得很糟糕:
org.xml.sax.SAXParseException; Character reference "�" is an invalid XML character.
at org.apache.xerces.parsers.AbstractSAXParser.parse
相比之下,Saxon 8.7 对其进行了正确的序列化。
有谁知道如何让 Xalan 正确书写?
这是显示问题的代码:
import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStream;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
public class SurrogatePairSerialisation {
public static String TRANSFORMER_FACTORY_PROCESSOR_XALAN = "org.apache.xalan.processor.TransformerFactoryImpl";
public static String TRANSFORMER_FACTORY_SAXON = "net.sf.saxon.TransformerFactoryImpl";
public static TransformerFactory transformerFactory;
static {
System.setProperty("javax.xml.transform.TransformerFactory",
TRANSFORMER_FACTORY_PROCESSOR_XALAN);
// TRANSFORMER_FACTORY_SAXON);
transformerFactory = javax.xml.transform.TransformerFactory.newInstance();
}
public static void main(String[] args) throws Exception {
// Verify using Java 7 or greater
System.out.println(System.getProperty("java.vendor") );
System.out.println( System.getProperty("java.version") );
char[] chars = {55357, 56459};
int codePoint = Character.codePointAt(chars, 0);
// Verify its a valid code point
System.out.println(Character.isValidCodePoint(codePoint));
// Convert it to a string
String astral = new String(Character.toChars(codePoint));
// Show that we can write the string to a file
FileOutputStream fos = new FileOutputStream(new File(System.getProperty("user.dir") + "/astral.txt"));
fos.write(astral.getBytes("UTF-8"));
fos.close(); // it is written as U+1F48B, as expected
// Now show how it all falls apart with Xalan
// Create a DOM doc containing astral char
DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilderFactory.setNamespaceAware(true);
DocumentBuilder db = documentBuilderFactory.newDocumentBuilder();
Document doc = db.newDocument();
Element foo = doc.createElement("foo");
doc.appendChild(foo);
foo.setTextContent(astral);
// Write using Transformer transform
FileOutputStream fos2 = new FileOutputStream(new File(System.getProperty("user.dir") + "/astral.xml"));
writeDocument(doc, fos2);
fos2.close(); // Xalan writes �� but Saxon 8.7 is ok
}
protected static void writeDocument(Document document, OutputStream outputStream) throws Exception {
Transformer serializer = transformerFactory.newTransformer();
System.out.println(serializer.getClass().getName());
serializer.setOutputProperty(javax.xml.transform.OutputKeys.ENCODING, "UTF-8");
serializer.setOutputProperty(javax.xml.transform.OutputKeys.OMIT_XML_DECLARATION, "yes");
serializer.setOutputProperty(javax.xml.transform.OutputKeys.METHOD, "xml");
serializer.transform( new DOMSource(document) , new StreamResult(outputStream) );
}
}
这是一个错误。
https://issues.apache.org/jira/browse/XALANJ-2419
https://issues.apache.org/jira/browse/XALANJ-2560
Serializing supplementary unicode characters into XML documents with Java
角色 1F48B 是在 Unicode 6.0
中引入的Unicode 6.0 support was introduced in Java 7.
我无法让 Xalan 2.7.2 的序列化程序正确写入该字符;相反,它写
在下游,事情会变得很糟糕:
org.xml.sax.SAXParseException; Character reference "�" is an invalid XML character.
at org.apache.xerces.parsers.AbstractSAXParser.parse
相比之下,Saxon 8.7 对其进行了正确的序列化。
有谁知道如何让 Xalan 正确书写?
这是显示问题的代码:
import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStream;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
public class SurrogatePairSerialisation {
public static String TRANSFORMER_FACTORY_PROCESSOR_XALAN = "org.apache.xalan.processor.TransformerFactoryImpl";
public static String TRANSFORMER_FACTORY_SAXON = "net.sf.saxon.TransformerFactoryImpl";
public static TransformerFactory transformerFactory;
static {
System.setProperty("javax.xml.transform.TransformerFactory",
TRANSFORMER_FACTORY_PROCESSOR_XALAN);
// TRANSFORMER_FACTORY_SAXON);
transformerFactory = javax.xml.transform.TransformerFactory.newInstance();
}
public static void main(String[] args) throws Exception {
// Verify using Java 7 or greater
System.out.println(System.getProperty("java.vendor") );
System.out.println( System.getProperty("java.version") );
char[] chars = {55357, 56459};
int codePoint = Character.codePointAt(chars, 0);
// Verify its a valid code point
System.out.println(Character.isValidCodePoint(codePoint));
// Convert it to a string
String astral = new String(Character.toChars(codePoint));
// Show that we can write the string to a file
FileOutputStream fos = new FileOutputStream(new File(System.getProperty("user.dir") + "/astral.txt"));
fos.write(astral.getBytes("UTF-8"));
fos.close(); // it is written as U+1F48B, as expected
// Now show how it all falls apart with Xalan
// Create a DOM doc containing astral char
DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilderFactory.setNamespaceAware(true);
DocumentBuilder db = documentBuilderFactory.newDocumentBuilder();
Document doc = db.newDocument();
Element foo = doc.createElement("foo");
doc.appendChild(foo);
foo.setTextContent(astral);
// Write using Transformer transform
FileOutputStream fos2 = new FileOutputStream(new File(System.getProperty("user.dir") + "/astral.xml"));
writeDocument(doc, fos2);
fos2.close(); // Xalan writes �� but Saxon 8.7 is ok
}
protected static void writeDocument(Document document, OutputStream outputStream) throws Exception {
Transformer serializer = transformerFactory.newTransformer();
System.out.println(serializer.getClass().getName());
serializer.setOutputProperty(javax.xml.transform.OutputKeys.ENCODING, "UTF-8");
serializer.setOutputProperty(javax.xml.transform.OutputKeys.OMIT_XML_DECLARATION, "yes");
serializer.setOutputProperty(javax.xml.transform.OutputKeys.METHOD, "xml");
serializer.transform( new DOMSource(document) , new StreamResult(outputStream) );
}
}
这是一个错误。
https://issues.apache.org/jira/browse/XALANJ-2419
https://issues.apache.org/jira/browse/XALANJ-2560
Serializing supplementary unicode characters into XML documents with Java