使用正则表达式拆分字符串。 Java

Question

我试图用单词填充 ArrayList，但有时它会添加一个空字符，为什么？我怎样才能避免这种情况？

    ArrayList<String> textAL = new ArrayList<String>();
    String text = "This.IS(a) text example blah? bl:ah";
    String regex = "[\s\?\.,:;\)\(]";

    String[] splittedText = text.split(regex);

    for(int i = 0; i < splittedText.length; i++){
        if(splittedText[i] != " "){  //ignore whitespace
            textAL.add(splittedText[i]);
        }           
    }

    for(int i = 0; i < textAL.size(); i++){
        System.out.println("t2(" + i + ") "+ textAL.get(i));
    }

结果：

textAL(0) This
textAL(1) IS
textAL(2) a
textAL(3) 
textAL(4) text
textAL(5) example
textAL(6) blah
textAL(7) 
textAL(8) bl
textAL(9) 
textAL(10) ah

Answer 1

我认为问题在于您忘记了正则表达式末尾的 +，例如

String regex = "[\s\?\.,:;\)\(]+"

但是像

这样简单的东西怎么样

String regex = "\W+";

注意 \W 等同于 ^\w

测试：

public static void main(String[] args) {
  ArrayList
  <String> textAL = new ArrayList<String>();
  String text = "This.IS(a) text example blah? bl:ah";
  // String regex = "[\s\?\.,:;\)\(]+";
  String regex = "\W+";

  String[] splittedText = text.split(regex);

  for(int i = 0; i < splittedText.length; i++){
      textAL.add(splittedText[i]);
  }

  for(int i = 0; i < textAL.size(); i++){
      System.out.println("t2(" + i + ") "+ textAL.get(i));
  }
}

结果：

t2(0) This
t2(1) IS
t2(2) a
t2(3) text
t2(4) example
t2(5) blah
t2(6) bl
t2(7) ah

编辑

您的其他问题在这里：

splittedText[i] != " "

您正在使用 != 运算符比较字符串，并且您不想使用 == 或 != 来比较字符串。相反，请使用 equals(...) 或 equalsIgnoreCase(...) 方法。了解 == 和 != 检查两个对象是否相同，这不是您感兴趣的。另一方面，这些方法检查两个字符串是否具有相同的字符相同的顺序，这就是这里的重点。

幸运的是，如果您使用正确的正则表达式，以上内容对于您当前的代码来说不是问题，但在未来的代码中可能会成为问题，所以请牢记这一点。

Answer 2

您需要在 Pattern 中添加量词：

String text = "This.IS(a) text example blah? bl:ah";
// Edit: now with removed escapes when not necessary - thanks hwnd
//              ┌ original character class
//              |          ┌ greedy quantifier: "one or more times"
//              |          |
String regex = "[\s?.:;)(]+";
String[] splittedText = text.split(regex);
System.out.println(Arrays.toString(splittedText));

输出

[This, IS, a, text, example, blah, bl, ah]

Answer 3

那 String regex = "[^\w]+"; 呢，这样做是为了让您可以添加自己不想匹配的字符，比如撇号 "[^\w']+"

使用正则表达式拆分字符串。 Java

Splitting strings with regex. Java

java

regex

split

arraylist