使用 Java 正则表达式在一个句子中查找多个匹配词

Question

我有一句话，一套话说；梅威瑟，不败……等等。我想：

检查句子是否包含上述任何单词……（我希望它只查找匹配的单词，基本上忽略句号、逗号和换行符。）
如果是这样，我想在每个匹配的单词前后显示几个单词，也许可以使用 String.format()

这是我的代码，它似乎工作正常但不完全是我想要的：

String sentence = "Floyd Mayweather Jr is an American professional boxer " +
            "currently undefeated as a professional and is a five-division world champion, " +
            "having won ten world titles and the lineal championship in four different weight classes.";

    String newText = "";
    Pattern p = Pattern.compile("(Mayweather) .* (undefeated)");
    Matcher m = p.matcher(sentence);

    if (m.find()) {
        String group1 = m.group(1);
        String group2 = m.group(2);

        newText = String.format("%s ... %s" , group1, group2);
        System.out.println(newText);
    }

现在的输出是：

Mayweather ... undefeated

我想要的是这样的：

Floyd Mayweather Jr is an American ... currently undefeated as a professional ...

你能告诉我怎么做吗，或者指导我正确的方向，因为我被卡住了。

在此先感谢大家。

Answer 1

你可以试试下面的一种，

注意：这只是一个原型，所以不要直接复制粘贴

String str="Floyd Mayweather Jr is an American professional boxer currently undefeated as a professional and is a five-division world champion, having won ten world titles and the lineal championship in four different weight classes.";
    int firstIndex=str.indexOf("American");
    int secondIndex=str.indexOf("boxer");
    String group1=str.substring(0,firstIndex+"American".length()); // gives you 1st group

    String group2=str.substring(secondIndex);
    String newText = String.format("%s ... %s" , group1, group2);
    System.out.println(newText);

输出

Floyd Mayweather Jr is an American ... boxer currently undefeated as a professional and is a five-division world champion, having won ten world titles and the lineal championship in four different weight classes.

Answer 2

如果你真的想通过 RegEx 解决这个问题，你需要让你的捕获组匹配你想要输出的所有内容。目前它们仅匹配您的搜索字词：

(Mayweather) .* (undefeated)
// "Mayweather", "undefeated"

你可以尝试这样的事情（只使用一组！），但这会匹配你的整个例子：

(.*Mayweather.*undefeated.*)
// -whole text-

可以改成这样，再次匹配两部分，前后最多12个字符（不要在中间的"match all"周围使用空格，使其非贪婪！）：

(.{0,12}Mayweather.{0,12}).*?(.{0,12}undefeated.{0,12})
// "Floyd Mayweather Jr is an Am", "r currently undefeated as a profes"

可以进一步细化以在单词边界处停止（结果需要修剪）：

(\b.{0,12}Mayweather.{0,12}\b).*?(\b.{0,12}undefeated.{0,12}\b)
// "Floyd Mayweather Jr is an ", " currently undefeated as a "

将此更改为输出固定数量的单词留作无聊的练习 reader。

编辑： 修复了最后两个版本中“.*”的贪婪（添加了“?”）。

Answer 3

您的代码的问题在于组的使用。 正则表达式组提供您首先尝试识别的字符串片段。

group(0)，也写成group=整个字符串。

group(1) 是您的第一个匹配项 = "Mayweather".

的第一个实例

group(2) 是您的第二个匹配项 = "undefeated".

的第一个实例

您可以使用 start(int group) 和 end(int group) 方法 来找到匹配的索引，并且然后对新字符串执行一些基本的字符串操作。

如果您打算专门使用正则表达式，您的解决方案如下：

      String sentence = ("Floyd Mayweather Jr is an American professional boxer " +
                  "currently undefeated as a professional and is a five-division                         world champion, " +
                  "having won ten world titles and the lineal championship in four      different weight classes.");

     /** Creates a StringBuilder, which can be altered, 
     *   unlike a string, which is immutable. */
     StringBuilder sb = new StringBuilder(sentence.length());

     Pattern p = Pattern.compile("(Mayweather) .* (undefeated)");
     Matcher m = p.matcher(sentence);

     if (m.find()) {
         int g1Start = m.start(1);
         int g1End = m.end(1);

         int g2Start = m.start(2);
         int g2End = m.end(2);

         sb.append(sentence.substring(0, g1Start));
         sb.append("...");
         sb.append(sentence.substring(g1End, g2Start));
         sb.append("...");
         sb.append(sentence.substring(g2End, (sentence.length() - 1)));

我不确定你是否需要在末尾使用换行符，但如果需要：

         sb.append("\r\n");

那剩下的就简单了:

         newText = sb.toString();
         textView.setText(newText);
     }

希望对您有所帮助:)

使用 Java 正则表达式在一个句子中查找多个匹配词

Use Java Regex to find multiple matching words in a sentence

java

regex

string-matching