使用正则表达式从 Java 中的字符串中删除 Unicode 字符

Question

我的输入字符串如下所示。

String comment = "Good morning! \u2028\u2028I am looking to purchase a new Honda car as I\u2019m outgrowing my current car. I currently drive a Hyundai Accent and I was looking for something a
 little bit larger and more comfortable like the Honda Civic. May I know if you have any of the models currently in stock? Thank you! Warm regards Sandra";

我想删除评论中出现的 Unicode 字符，如 "\u2028" 、 "\u2019" 等 section.In 运行时我不知道所有额外的字符是什么。那么处理这个问题的最佳方法是什么？

我像下面那样尝试删除给定字符串中的 unicode 字符。

Comments.replaceAll("\P{Print}", "");

那么匹配注释部分中存在的 Unicode 字符的最佳方法是什么？如果存在，请将其删除，否则只需将注释传递给目标系统即可。

谁能帮我解决这个问题？

Answer 1

您可以像下面这样按顺序执行此操作：

public static void main(final String args[]) {
    String comment = "Good morning! \u2028\u2028I am looking to purchase a new Honda car as I\u2019m outgrowing my current car. I currently drive a Hyundai Accent and I was looking for something a little bit larger and more comfortable like the Honda Civic. May I know if you have any of the models currently in stock? Thank you! Warm regards Sandra";

    // remove all non-ASCII characters
    comment = comment.replaceAll("[^\x00-\x7F]", "");

    // remove all the ASCII control characters
    comment = comment.replaceAll("[\p{Cntrl}&&[^\r\n\t]]", "");

    // removes non-printable characters from Unicode
    comment = comment.replaceAll("\p{C}", "");
    System.out.println(comment);
  }

Answer 2

如果使用replace，会丢失一些字符，例如I'm会变成Im。所以最好的办法就是转换。

您可以将 Unicode 转换为 UTF-8。

byte[] byteComment = comment.getBytes("UTF-8");

String formattedComment = new String(byteComment, "UTF-8");

使用正则表达式从 Java 中的字符串中删除 Unicode 字符

To remove Unicode character from String in Java using REGEX

java

regex

unicode

non-ascii-characters