JAVA 匹配组

Question

我正在使用正则表达式构建一个简单的 Twitter 用户提及查找器。

public static Set<String> getMentionedUsers(List<Tweet> tweets) {
    Set<String> mentionedUsers = new TreeSet<>();
    String regex = "(?<=^|(?<=[^a-zA-Z0-9-_\\.]))@([A-Za-z][A-Za-z0-9_]+)";

    for(Tweet tweet : tweets){
        Matcher matcher = Pattern.compile(regex).matcher(tweet.getText().toLowerCase());
        if(matcher.find()) {
            mentionedUsers.add(matcher.group(0));
        }
    }
    return mentionedUsers;
}

如果表达式位于文本末尾，则无法找到匹配项，例如“@glover tell me about @GREG “它 returns 只有“@glover”。

Answer 1

您必须在一条推文上继续使用 matcher.find() 循环，直到您找不到更多匹配项，目前您只检查每条推文一次。

（旁注：您应该在 for 循环之外编译模式，最好是在方法之外编译它）

public static Set<String> getMentionedUsers(List<Tweet> tweets) {
    Set<String> mentionedUsers = new TreeSet<>();
    String regex = "(?<=^|(?<=[^a-zA-Z0-9-_\\.]))@([A-Za-z][A-Za-z0-9_]+)";

    Pattern p = Pattern.compile(regex);
    for(Tweet tweet : tweets){
        Matcher matcher = p.matcher(tweet.getText().toLowerCase());
        while (matcher.find()) {
            mentionedUsers.add(matcher.group(0));
        }
    }
    return mentionedUsers;
}

Answer 2

您正在将 matcher.group(0) 添加到您的 Set，查看 Java Docs

Group zero denotes the entire pattern, so the expression m.group(0) is equivalent to m.group().

捕获组从1开始，见reference

Group number

Capturing groups are numbered by counting their opening parentheses from left to right. In the expression ((A)(B(C))), for example, there are four such groups:

1 ((A)(B(C)))

2 (A)

3 (B(C))

4 (C)

Group zero always stands for the entire expression.

Capturing groups are so named because, during a match, each subsequence of the input sequence that matches such a group is saved. The captured subsequence may be used later in the expression, via a back reference, and may also be retrieved from the matcher once the match operation is complete.

JAVA 匹配组

JAVA matchers group

java

regex-group