字符串 API 中的 ReplaceAll 方法
ReplaceAll method in String API
我有一个情况,我必须从字符串中替换一些字符(特殊的、不可打印的和其他特殊字符),如下所述
private static final String NON_ASCII_CHARACTERS = "[^\x00-\x7F]";
private static final String ASCII_CONTROL_CHARACTERS = "[\p{Cntrl}&&[^\r\n\t]]";
private static final String NON_PRINTABLE_CHARACTERS = "\p{C}";
stringValue.replaceAll(NON_ASCII_CHARACTERS, "").replaceAll(ASCII_CONTROL_CHARACTERS, "")
.replaceAll(NON_PRINTABLE_CHARACTERS, "");
我们可以重构上面的代码意味着我们可以使用单个“replaceAll”方法并将所有条件放入其中吗?
有什么办法请指教
您可以使用正则表达式或运算符|
private static final String NON_ASCII_CHARACTERS = "[^\x00-\x7F]";
private static final String ASCII_CONTROL_CHARACTERS = "[\p{Cntrl}&&[^\r\n\t]]";
private static final String NON_PRINTABLE_CHARACTERS = "\p{C}";
public static String process(String stringValue) {
return stringValue.replaceAll(NON_ASCII_CHARACTERS + "|"+ ASCII_CONTROL_CHARACTERS +"|"+ NON_PRINTABLE_CHARACTERS, "");
}
public static void main(String[] args) {
String val = process("A9339a0zzz]3");
System.out.println(val);
}
代码点
您可以考虑使用正则表达式以外的其他途径。每个字符可以使用code point整数,字符类别查询Character
class。
String input = … ;
String output =
input
.codePoints() // Returns an `IntStream` of code point `int` values.
.filter( codePoint -> ! Character.isISOControl( codePoint ) ) // Filter for the characters you want to keep. Those code points flunking the `Predicate` test will be omitted.
.filter( codePoint -> codePoint < 127 ) ; // Within US-ASCII range. Code point 127 is US-ASCII but is DEL, so we filter that out here.
.collect( StringBuilder :: new , StringBuilder :: appendCodePoint , StringBuilder :: append ) // Convert the `int` code point integers back into characters.
.toString() ; // Make a `String` from the contents of the `StringBuilder`.
Character
class has many of the classifications defined by the Unicode Consortium。您可以使用它们将代码点流缩小到代表您所需字符的代码点。
根据 Pattern
javadocs,也应该可以将三个字符 class 模式组合成一个字符 class:
private static final String NON_ASCII_CHARACTERS = "[^\x00-\x7F]";
private static final String ASCII_CONTROL_CHARACTERS = "[\p{Cntrl}&&[^\r\n\t]]";
private static final String NON_PRINTABLE_CHARACTERS = "\p{C}";
变成
private static final String COMBINED =
"[[^\x00-\x7F][\p{Cntrl}&&[^\r\n\t]]\p{C}]";
或
private static final String COMBINED =
"[" + NON_ASCII_CHARACTERS + ASCII_CONTROL_CHARACTERS
+ NON_PRINTABLE_CHARACTERS + "]";
请注意,&&
(交集)的优先级低于隐式联合运算符,因此上面的所有 [
和 ]
meta-characters 都是必需的。
您决定您认为哪个版本更清楚。见仁见智。
我有一个情况,我必须从字符串中替换一些字符(特殊的、不可打印的和其他特殊字符),如下所述
private static final String NON_ASCII_CHARACTERS = "[^\x00-\x7F]";
private static final String ASCII_CONTROL_CHARACTERS = "[\p{Cntrl}&&[^\r\n\t]]";
private static final String NON_PRINTABLE_CHARACTERS = "\p{C}";
stringValue.replaceAll(NON_ASCII_CHARACTERS, "").replaceAll(ASCII_CONTROL_CHARACTERS, "")
.replaceAll(NON_PRINTABLE_CHARACTERS, "");
我们可以重构上面的代码意味着我们可以使用单个“replaceAll”方法并将所有条件放入其中吗?
有什么办法请指教
您可以使用正则表达式或运算符|
private static final String NON_ASCII_CHARACTERS = "[^\x00-\x7F]";
private static final String ASCII_CONTROL_CHARACTERS = "[\p{Cntrl}&&[^\r\n\t]]";
private static final String NON_PRINTABLE_CHARACTERS = "\p{C}";
public static String process(String stringValue) {
return stringValue.replaceAll(NON_ASCII_CHARACTERS + "|"+ ASCII_CONTROL_CHARACTERS +"|"+ NON_PRINTABLE_CHARACTERS, "");
}
public static void main(String[] args) {
String val = process("A9339a0zzz]3");
System.out.println(val);
}
代码点
您可以考虑使用正则表达式以外的其他途径。每个字符可以使用code point整数,字符类别查询Character
class。
String input = … ;
String output =
input
.codePoints() // Returns an `IntStream` of code point `int` values.
.filter( codePoint -> ! Character.isISOControl( codePoint ) ) // Filter for the characters you want to keep. Those code points flunking the `Predicate` test will be omitted.
.filter( codePoint -> codePoint < 127 ) ; // Within US-ASCII range. Code point 127 is US-ASCII but is DEL, so we filter that out here.
.collect( StringBuilder :: new , StringBuilder :: appendCodePoint , StringBuilder :: append ) // Convert the `int` code point integers back into characters.
.toString() ; // Make a `String` from the contents of the `StringBuilder`.
Character
class has many of the classifications defined by the Unicode Consortium。您可以使用它们将代码点流缩小到代表您所需字符的代码点。
根据 Pattern
javadocs,也应该可以将三个字符 class 模式组合成一个字符 class:
private static final String NON_ASCII_CHARACTERS = "[^\x00-\x7F]";
private static final String ASCII_CONTROL_CHARACTERS = "[\p{Cntrl}&&[^\r\n\t]]";
private static final String NON_PRINTABLE_CHARACTERS = "\p{C}";
变成
private static final String COMBINED =
"[[^\x00-\x7F][\p{Cntrl}&&[^\r\n\t]]\p{C}]";
或
private static final String COMBINED =
"[" + NON_ASCII_CHARACTERS + ASCII_CONTROL_CHARACTERS
+ NON_PRINTABLE_CHARACTERS + "]";
请注意,&&
(交集)的优先级低于隐式联合运算符,因此上面的所有 [
和 ]
meta-characters 都是必需的。
您决定您认为哪个版本更清楚。见仁见智。