在 Java 中查找多个正则表达式匹配,禁止不匹配

Find multiple regex matches in Java prohibiting non-matches

我有一个 Java Pattern 例如 \s+(foo|bar) 来查找 whitespace 之后 foobar 的所有匹配项。使用匹配组我可以提取实际匹配的文本。

Pattern pattern=Pattern.compile("\s+(foo|bar)");
Matcher matcher = pattern.match(someText);
while(matcher.find()) {
  String value = matcher.group(1);
  ...
}

这适用于像 foo foo bar 这样的字符串(注意前面的 space),但它也可以匹配像 foo foo bad 这样的字符串。我怎样才能阻止匹配器匹配不匹配的后续字符运行,或者检测到字符被跳过或没有更多字符剩余?换句话说,我希望被匹配的整个字符串是一系列与模式匹配的后续字符串。我怎么能保证呢?

这里的要点是继续遍历字符串查找匹配项。我可以轻松地拆分字符串,然后执行额外的比较,但我不希望多个正则表达式传递、array/list 创建等的开销

\G 作为正则表达式的前缀。 Pattern 的 Javadoc 说:

\G - The end of the previous match

当然,在第一个匹配上,"the end of the previous match"是输入的开始。

这确保正则表达式匹配都是连续的,从输入的开头开始。并不意味着正则表达式会到达输入的末尾,您必须自己检查一下。

例子

public static void main(String[] args) {
    test("abc");
    test(" foo foo bar");
    test(" foo foo bad");
    test(" foo bad foo");
}
static void test(String input) {
    System.out.println("'" + input + "'");
    int lastEnd = 0;
    Matcher m = Pattern.compile("\G\s+(foo|bar)").matcher(input);
    while (m.find()) {
        System.out.printf("  g0='%s' (%d-%d), g1='%s' (%d-%d)%n",
                          m.group(), m.start(), m.end(),
                          m.group(1), m.start(1), m.end(1));
        lastEnd = m.end();
    }
    if (lastEnd == input.length())
        System.out.println("  OK");
    else
        System.out.println("  Incomplete: Last match ended at " + lastEnd);
}

输出

'abc'
  Incomplete: Last match ended at 0
' foo foo bar'
  g0=' foo' (0-4), g1='foo' (1-4)
  g0=' foo' (4-8), g1='foo' (5-8)
  g0=' bar' (8-12), g1='bar' (9-12)
  OK
' foo foo bad'
  g0=' foo' (0-4), g1='foo' (1-4)
  g0=' foo' (4-8), g1='foo' (5-8)
  Incomplete: Last match ended at 8
' foo bad foo'
  g0=' foo' (0-4), g1='foo' (1-4)
  Incomplete: Last match ended at 4

为了比较,如果在正则表达式中没有 \G,该代码的输出将是:

'abc'
  Incomplete: Last match ended at 0
' foo foo bar'
  g0=' foo' (0-4), g1='foo' (1-4)
  g0=' foo' (4-8), g1='foo' (5-8)
  g0=' bar' (8-12), g1='bar' (9-12)
  OK
' foo foo bad'
  g0=' foo' (0-4), g1='foo' (1-4)
  g0=' foo' (4-8), g1='foo' (5-8)
  Incomplete: Last match ended at 8
' foo bad foo'
  g0=' foo' (0-4), g1='foo' (1-4)
  g0=' foo' (8-12), g1='foo' (9-12)
  OK

如您所见,最后一个示例无法检测到跳过的文本 bad

需要额外执行 match 的解决方案是首先尝试将输入与以下正则表达式匹配:

^(\s+(foo|bar))+$

然后你可以重复查找:

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Test
{
    public static void main(String[] args) {
        String[] tests =  {
            " foo foo bar",
            " foo foo x foo bar"
        };
        Pattern pattern1 = Pattern.compile("(\s+(foo|bar))+");
        Pattern pattern2 = Pattern.compile("\s+(foo|bar)");
        for (int i = 0; i < tests.length; i++) {
            String test = tests[i];
            Matcher m1 = pattern1.matcher(test);
            if (m1.matches()) {
                System.out.println("Matches against: '" + test + "'");
                Matcher m2 = pattern2.matcher(test);
                while (m2.find()) {
                    System.out.println("\t'" + m2.group() + "'");
                }
            }
        }
    }
}

打印:

Matches against: ' foo foo bar'
        ' foo'
        ' foo'
        ' bar'

如果整个输入不必匹配,那么我们使用正则表达式查找匹配的字符串的前缀:

^(\s+(foo|bar))+

您可以针对输入测试此匹配项的长度以确定是否匹配整个字符串。

然后:

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Test
{
    public static void main(String[] args) {
        String[] tests =  {
            " foo foo bar",
            " foo foo x foo bar"
        };
        Pattern pattern1 = Pattern.compile("^(\s+(foo|bar))+");
        Pattern pattern2 = Pattern.compile("\s+(foo|bar)");
        for (int i = 0; i < tests.length; i++) {
            String test = tests[i];
            Matcher m1 = pattern1.matcher(test);
            if (m1.find()) {
                String s = m1.group();
                System.out.println("Matches against: '" + s + "'");
                Matcher m2 = pattern2.matcher(s);
                while (m2.find()) {
                    System.out.println("\t'" + m2.group() + "'");
                }
            }
        }
    }
}

打印:

Matches against: ' foo foo bar'
        ' foo'
        ' foo'
        ' bar'
Matches against: ' foo foo'
        ' foo'
        ' foo'