Java 正则表达式匹配器会在 BMP 之外打断字符
Java RegEx matcher breaks characters outside the BMP
我目前正在编写一个 util class 来 sanitize 输入,它被保存到 xml 文档中。对我们来说,清理意味着所有非法字符 (https://en.wikipedia.org/wiki/Valid_characters_in_XML#XML_1.0) 都只是从字符串中删除。
我试图通过使用一些正则表达式来做到这一点,它用一个空字符串替换所有无效字符,但是对于 BMP 之外的 unicode 字符,这似乎以某种方式破坏了编码,给我留下了那些 ?
人物。我使用哪种正则表达式替换方式似乎也无关紧要 (String#replaceAll(String, String)
, Pattern#compile(String)
, org.apache.commons.lang3.RegExUtil#removeAll(String, String)
)
这是一个带有测试(在 Spock 中)的示例实现,它显示了问题:
XmlStringUtil.java
package com.example.util;
import lombok.NonNull;
import java.util.regex.Pattern;
public class XmlStringUtil {
private static final Pattern XML_10_PATTERN = Pattern.compile(
"[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD\x{10000}-\x{10FFFF}]"
);
public static String sanitizeXml10(@NonNull String text) {
return XML_10_PATTERN.matcher(text).replaceAll("");
}
}
XmlStringUtilSpec.groovy
package com.example.util
import spock.lang.Specification
class XmlStringUtilSpec extends Specification {
def 'sanitize string values for xml version 1.0'() {
when: 'a string is sanitized'
def sanitizedString = XmlStringUtil.sanitizeXml10 inputString
then: 'the returned sanitized string matches the expected one'
sanitizedString == expectedSanitizedString
where:
inputString | expectedSanitizedString
'' | ''
'\b' | ''
'\u0001' | ''
'Hello World![=12=]' | 'Hello World!'
'text with emoji \uD83E\uDDD1\uD83C\uDFFB' | 'text with emoji \uD83E\uDDD1\uD83C\uDFFB'
}
}
我现在有一个解决方案,我从它的单个代码点重建整个字符串,但这似乎不是正确的解决方案。
提前致谢!
没有正则表达式的解决方案可以是过滤代码点流:
public static String sanitize_xml_10(String input) {
return input.codePoints()
.filter(Test::allowedXml10)
.collect(StringBuilder::new,StringBuilder::appendCodePoint, StringBuilder::append)
.toString();
}
private static boolean allowedXml10(int codepoint) {
if(0x0009==codepoint) return true;
if(0x000A==codepoint) return true;
if(0x000D==codepoint) return true;
if(0x0020<=codepoint && codepoint<=0xD7FF) return true;
if(0xE000<=codepoint && codepoint<=0xFFFD) return true;
if(0x10000<=codepoint && codepoint<=0x10FFFF) return true;
return false;
}
经过一些阅读和试验后,对正则表达式稍作更改(将 \x{..}
替换为替代项 \u...\u...
有效:
private static final Pattern XML_10_PATTERN = Pattern.compile(
"[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF]"
);
检查:
sanitizeXml10("\uD83E\uDDD1\uD83C\uDFFB").codePoints().mapToObj(Integer::toHexString).forEach(System.out::println);
结果
1f9d1
1f3fb
我目前正在编写一个 util class 来 sanitize 输入,它被保存到 xml 文档中。对我们来说,清理意味着所有非法字符 (https://en.wikipedia.org/wiki/Valid_characters_in_XML#XML_1.0) 都只是从字符串中删除。
我试图通过使用一些正则表达式来做到这一点,它用一个空字符串替换所有无效字符,但是对于 BMP 之外的 unicode 字符,这似乎以某种方式破坏了编码,给我留下了那些 ?
人物。我使用哪种正则表达式替换方式似乎也无关紧要 (String#replaceAll(String, String)
, Pattern#compile(String)
, org.apache.commons.lang3.RegExUtil#removeAll(String, String)
)
这是一个带有测试(在 Spock 中)的示例实现,它显示了问题: XmlStringUtil.java
package com.example.util;
import lombok.NonNull;
import java.util.regex.Pattern;
public class XmlStringUtil {
private static final Pattern XML_10_PATTERN = Pattern.compile(
"[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD\x{10000}-\x{10FFFF}]"
);
public static String sanitizeXml10(@NonNull String text) {
return XML_10_PATTERN.matcher(text).replaceAll("");
}
}
XmlStringUtilSpec.groovy
package com.example.util
import spock.lang.Specification
class XmlStringUtilSpec extends Specification {
def 'sanitize string values for xml version 1.0'() {
when: 'a string is sanitized'
def sanitizedString = XmlStringUtil.sanitizeXml10 inputString
then: 'the returned sanitized string matches the expected one'
sanitizedString == expectedSanitizedString
where:
inputString | expectedSanitizedString
'' | ''
'\b' | ''
'\u0001' | ''
'Hello World![=12=]' | 'Hello World!'
'text with emoji \uD83E\uDDD1\uD83C\uDFFB' | 'text with emoji \uD83E\uDDD1\uD83C\uDFFB'
}
}
我现在有一个解决方案,我从它的单个代码点重建整个字符串,但这似乎不是正确的解决方案。
提前致谢!
没有正则表达式的解决方案可以是过滤代码点流:
public static String sanitize_xml_10(String input) {
return input.codePoints()
.filter(Test::allowedXml10)
.collect(StringBuilder::new,StringBuilder::appendCodePoint, StringBuilder::append)
.toString();
}
private static boolean allowedXml10(int codepoint) {
if(0x0009==codepoint) return true;
if(0x000A==codepoint) return true;
if(0x000D==codepoint) return true;
if(0x0020<=codepoint && codepoint<=0xD7FF) return true;
if(0xE000<=codepoint && codepoint<=0xFFFD) return true;
if(0x10000<=codepoint && codepoint<=0x10FFFF) return true;
return false;
}
经过一些阅读和试验后,对正则表达式稍作更改(将 \x{..}
替换为替代项 \u...\u...
有效:
private static final Pattern XML_10_PATTERN = Pattern.compile(
"[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF]"
);
检查:
sanitizeXml10("\uD83E\uDDD1\uD83C\uDFFB").codePoints().mapToObj(Integer::toHexString).forEach(System.out::println);
结果
1f9d1
1f3fb