从 'unit separator' (0x1f) 字符中清除 xml 的字符串
clean string from 'unit separator' (0x1f) character for xml
运行 进入以下异常解析 XML 从输入生成:
org.xml.sax.SAXParseException: Zeichenreferenz "&#
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
我将问题追溯到包含字符 0x1f
的输入字符串,这是一个不可见的“UNIT SEPARATOR”字符:http://www.columbia.edu/kermit/ascii.html
我必须将输入复制到文本文件中以使其可见:
在其他地方测试了输入字符串,并且 运行 遇到了如下问题:
Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: XML parsing: line 1, character 149, illegal xml character
at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:262)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1632)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:602)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:524)
at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7418)
从输入字符串中去除此类字符的最佳方法是什么,XML 是否还有其他应该删除的有问题的字符?
这是我最终得到的解决方案:
/** RegEx pattern of invalid xml 1.0 characters, ref : http://www.w3.org/TR/REC-xml/#charsets */
private static final Pattern INVALID_XML_CHAR_PATTERN = Pattern
.compile("[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD\x{10000}-\x{10FFFF}]"); //$NON-NLS-1$
/**
* sanitize the passed value for xml 1.0
*
* @param input input value to sanitize
* @return null if input was not changed
*/
public static String sanitizeXmlChars(String input) {
if (input == null || ("".equals(input))) { //$NON-NLS-1$
return null;
}
Matcher matcher = INVALID_XML_CHAR_PATTERN.matcher(input);
if (matcher.find()) {
return matcher.replaceAll(""); //$NON-NLS-1$
}
return null;
}
灵感来自:https://www.rgagnon.com/javadetails/java-sanitize-xml-string.html
通过简单的 JUnit 测试:
public class StringUtilTest {
@Test
public void sanitizeXmlChars() {
String goodXml = "<xml>value'<sub><![CDATA[Inhaltää]]></sub></xml>"; //$NON-NLS-1$
assertNull(StringUtil.sanitizeXmlChars(goodXml));
// contains control character after <xml>
String badXml = "<xml>" + (char) 31 + "value'<sub><![CDATA[Inhaltää]]></sub></xml>"; //$NON-NLS-1$
String result = StringUtil.sanitizeXmlChars(badXml);
assertEquals(goodXml, result);
String goodText = "This is a Text.\nWith two lines."; //$NON-NLS-1$
assertNull(StringUtil.sanitizeXmlChars(goodXml));
// contains control character after two
badXml = "This is a Text.\nWith two " + (char) 31 + "lines."; //$NON-NLS-1$
result = StringUtil.sanitizeXmlChars(badXml);
assertEquals(goodText, result);
goodText = "Text Text2"; //$NON-NLS-1$
assertNull(StringUtil.sanitizeXmlChars(goodXml));
badXml = "Text "; //$NON-NLS-1$
// append control characters e.g. 30=>Record Separator 31=>Unit Separator
for (int i = 1; i <= 31; i++) {
// skip valid control characters: Horizontal Tab, Line Feed, Carriage Return
if (i == 9 || i == 10 || i == 13) {
continue;
}
badXml += String.valueOf((char) i);
}
badXml += "Text2";
result = StringUtil.sanitizeXmlChars(badXml);
assertEquals(goodText, result);
}
}
替代解决方案,使用第三方库,例如apache 通用语言:
String cleanInput = StringEscapeUtils.escapeXml10(input)
运行 进入以下异常解析 XML 从输入生成:
org.xml.sax.SAXParseException: Zeichenreferenz "&#
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
我将问题追溯到包含字符 0x1f
的输入字符串,这是一个不可见的“UNIT SEPARATOR”字符:http://www.columbia.edu/kermit/ascii.html
我必须将输入复制到文本文件中以使其可见:
在其他地方测试了输入字符串,并且 运行 遇到了如下问题:
Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: XML parsing: line 1, character 149, illegal xml character
at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:262)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1632)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:602)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:524)
at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7418)
从输入字符串中去除此类字符的最佳方法是什么,XML 是否还有其他应该删除的有问题的字符?
这是我最终得到的解决方案:
/** RegEx pattern of invalid xml 1.0 characters, ref : http://www.w3.org/TR/REC-xml/#charsets */
private static final Pattern INVALID_XML_CHAR_PATTERN = Pattern
.compile("[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD\x{10000}-\x{10FFFF}]"); //$NON-NLS-1$
/**
* sanitize the passed value for xml 1.0
*
* @param input input value to sanitize
* @return null if input was not changed
*/
public static String sanitizeXmlChars(String input) {
if (input == null || ("".equals(input))) { //$NON-NLS-1$
return null;
}
Matcher matcher = INVALID_XML_CHAR_PATTERN.matcher(input);
if (matcher.find()) {
return matcher.replaceAll(""); //$NON-NLS-1$
}
return null;
}
灵感来自:https://www.rgagnon.com/javadetails/java-sanitize-xml-string.html
通过简单的 JUnit 测试:
public class StringUtilTest {
@Test
public void sanitizeXmlChars() {
String goodXml = "<xml>value'<sub><![CDATA[Inhaltää]]></sub></xml>"; //$NON-NLS-1$
assertNull(StringUtil.sanitizeXmlChars(goodXml));
// contains control character after <xml>
String badXml = "<xml>" + (char) 31 + "value'<sub><![CDATA[Inhaltää]]></sub></xml>"; //$NON-NLS-1$
String result = StringUtil.sanitizeXmlChars(badXml);
assertEquals(goodXml, result);
String goodText = "This is a Text.\nWith two lines."; //$NON-NLS-1$
assertNull(StringUtil.sanitizeXmlChars(goodXml));
// contains control character after two
badXml = "This is a Text.\nWith two " + (char) 31 + "lines."; //$NON-NLS-1$
result = StringUtil.sanitizeXmlChars(badXml);
assertEquals(goodText, result);
goodText = "Text Text2"; //$NON-NLS-1$
assertNull(StringUtil.sanitizeXmlChars(goodXml));
badXml = "Text "; //$NON-NLS-1$
// append control characters e.g. 30=>Record Separator 31=>Unit Separator
for (int i = 1; i <= 31; i++) {
// skip valid control characters: Horizontal Tab, Line Feed, Carriage Return
if (i == 9 || i == 10 || i == 13) {
continue;
}
badXml += String.valueOf((char) i);
}
badXml += "Text2";
result = StringUtil.sanitizeXmlChars(badXml);
assertEquals(goodText, result);
}
}
替代解决方案,使用第三方库,例如apache 通用语言:
String cleanInput = StringEscapeUtils.escapeXml10(input)