使用正则表达式从 Java 中的字符串中删除 Unicode 字符
To remove Unicode character from String in Java using REGEX
我的输入字符串如下所示。
String comment = "Good morning! \u2028\u2028I am looking to purchase a new Honda car as I\u2019m outgrowing my current car. I currently drive a Hyundai Accent and I was looking for something a
little bit larger and more comfortable like the Honda Civic. May I know if you have any of the models currently in stock? Thank you! Warm regards Sandra";
我想删除评论中出现的 Unicode 字符,如 "\u2028" 、 "\u2019" 等 section.In 运行时我不知道所有额外的字符是什么。那么处理这个问题的最佳方法是什么?
我像下面那样尝试删除给定字符串中的 unicode 字符。
Comments.replaceAll("\P{Print}", "");
那么匹配注释部分中存在的 Unicode 字符的最佳方法是什么?如果存在,请将其删除,否则只需将注释传递给目标系统即可。
谁能帮我解决这个问题?
您可以像下面这样按顺序执行此操作:
public static void main(final String args[]) {
String comment = "Good morning! \u2028\u2028I am looking to purchase a new Honda car as I\u2019m outgrowing my current car. I currently drive a Hyundai Accent and I was looking for something a little bit larger and more comfortable like the Honda Civic. May I know if you have any of the models currently in stock? Thank you! Warm regards Sandra";
// remove all non-ASCII characters
comment = comment.replaceAll("[^\x00-\x7F]", "");
// remove all the ASCII control characters
comment = comment.replaceAll("[\p{Cntrl}&&[^\r\n\t]]", "");
// removes non-printable characters from Unicode
comment = comment.replaceAll("\p{C}", "");
System.out.println(comment);
}
如果使用replace
,会丢失一些字符,例如I'm
会变成Im
。所以最好的办法就是转换。
您可以将 Unicode 转换为 UTF-8。
byte[] byteComment = comment.getBytes("UTF-8");
String formattedComment = new String(byteComment, "UTF-8");
我的输入字符串如下所示。
String comment = "Good morning! \u2028\u2028I am looking to purchase a new Honda car as I\u2019m outgrowing my current car. I currently drive a Hyundai Accent and I was looking for something a
little bit larger and more comfortable like the Honda Civic. May I know if you have any of the models currently in stock? Thank you! Warm regards Sandra";
我想删除评论中出现的 Unicode 字符,如 "\u2028" 、 "\u2019" 等 section.In 运行时我不知道所有额外的字符是什么。那么处理这个问题的最佳方法是什么?
我像下面那样尝试删除给定字符串中的 unicode 字符。
Comments.replaceAll("\P{Print}", "");
那么匹配注释部分中存在的 Unicode 字符的最佳方法是什么?如果存在,请将其删除,否则只需将注释传递给目标系统即可。
谁能帮我解决这个问题?
您可以像下面这样按顺序执行此操作:
public static void main(final String args[]) {
String comment = "Good morning! \u2028\u2028I am looking to purchase a new Honda car as I\u2019m outgrowing my current car. I currently drive a Hyundai Accent and I was looking for something a little bit larger and more comfortable like the Honda Civic. May I know if you have any of the models currently in stock? Thank you! Warm regards Sandra";
// remove all non-ASCII characters
comment = comment.replaceAll("[^\x00-\x7F]", "");
// remove all the ASCII control characters
comment = comment.replaceAll("[\p{Cntrl}&&[^\r\n\t]]", "");
// removes non-printable characters from Unicode
comment = comment.replaceAll("\p{C}", "");
System.out.println(comment);
}
如果使用replace
,会丢失一些字符,例如I'm
会变成Im
。所以最好的办法就是转换。
您可以将 Unicode 转换为 UTF-8。
byte[] byteComment = comment.getBytes("UTF-8");
String formattedComment = new String(byteComment, "UTF-8");