使用 Java 正则表达式在一个句子中查找多个匹配词

Use Java Regex to find multiple matching words in a sentence

我有一句话,一套话说;梅威瑟,不败……等等。 我想:

  1. 检查句子是否包含上述任何单词……(我希望它只查找匹配的单词,基本上忽略句号、逗号和换行符。)
  2. 如果是这样,我想在每个匹配的单词前后显示几个单词,也许可以使用 String.format()

这是我的代码,它似乎工作正常但不完全是我想要的:

String sentence = "Floyd Mayweather Jr is an American professional boxer " +
            "currently undefeated as a professional and is a five-division world champion, " +
            "having won ten world titles and the lineal championship in four different weight classes.";

    String newText = "";
    Pattern p = Pattern.compile("(Mayweather) .* (undefeated)");
    Matcher m = p.matcher(sentence);

    if (m.find()) {
        String group1 = m.group(1);
        String group2 = m.group(2);

        newText = String.format("%s ... %s" , group1, group2);
        System.out.println(newText);
    }

现在的输出是:

Mayweather ... undefeated

我想要的是这样的:

Floyd Mayweather Jr is an American ... currently undefeated as a professional ...

你能告诉我怎么做吗,或者指导我正确的方向,因为我被卡住了。

在此先感谢大家。

你可以试试下面的一种,

注意:这只是一个原型,所以不要直接复制粘贴

String str="Floyd Mayweather Jr is an American professional boxer currently undefeated as a professional and is a five-division world champion, having won ten world titles and the lineal championship in four different weight classes.";
    int firstIndex=str.indexOf("American");
    int secondIndex=str.indexOf("boxer");
    String group1=str.substring(0,firstIndex+"American".length()); // gives you 1st group

    String group2=str.substring(secondIndex);
    String newText = String.format("%s ... %s" , group1, group2);
    System.out.println(newText);

输出

Floyd Mayweather Jr is an American ... boxer currently undefeated as a professional and is a five-division world champion, having won ten world titles and the lineal championship in four different weight classes.

如果你真的想通过 RegEx 解决这个问题,你需要让你的捕获组匹配你想要输出的所有内容。目前它们仅匹配您的搜索字词:

(Mayweather) .* (undefeated)
// "Mayweather", "undefeated"

你可以尝试这样的事情(只使用一组!),但这会匹配你的整个例子:

(.*Mayweather.*undefeated.*)
// -whole text-

可以改成这样,再次匹配两部分,前后最多12个字符(不要在中间的"match all"周围使用空格,使其非贪婪!):

(.{0,12}Mayweather.{0,12}).*?(.{0,12}undefeated.{0,12})
// "Floyd Mayweather Jr is an Am", "r currently undefeated as a profes"

可以进一步细化以在单词边界处停止(结果需要修剪):

(\b.{0,12}Mayweather.{0,12}\b).*?(\b.{0,12}undefeated.{0,12}\b)
// "Floyd Mayweather Jr is an ", " currently undefeated as a "

将此更改为输出固定数量的单词留作无聊的练习 reader。

编辑: 修复了最后两个版本中“.*”的贪婪(添加了“?”)。

您的代码的问题在于组的使用。 正则表达式组提供您首先尝试识别的字符串片段。

group(0),也写成group=整个字符串。

group(1) 是您的第一个匹配项 = "Mayweather".

的第一个实例

group(2) 是您的第二个匹配项 = "undefeated".

的第一个实例

您可以使用 start(int group) 和 end(int group) 方法找到匹配的索引,并且然后对新字符串执行一些基本的字符串操作。

如果您打算专门使用正则表达式,您的解决方案如下:

      String sentence = ("Floyd Mayweather Jr is an American professional boxer " +
                  "currently undefeated as a professional and is a five-division                         world champion, " +
                  "having won ten world titles and the lineal championship in four      different weight classes.");

     /** Creates a StringBuilder, which can be altered, 
     *   unlike a string, which is immutable. */
     StringBuilder sb = new StringBuilder(sentence.length());

     Pattern p = Pattern.compile("(Mayweather) .* (undefeated)");
     Matcher m = p.matcher(sentence);

     if (m.find()) {
         int g1Start = m.start(1);
         int g1End = m.end(1);

         int g2Start = m.start(2);
         int g2End = m.end(2);

         sb.append(sentence.substring(0, g1Start));
         sb.append("...");
         sb.append(sentence.substring(g1End, g2Start));
         sb.append("...");
         sb.append(sentence.substring(g2End, (sentence.length() - 1)));

我不确定你是否需要在末尾使用换行符,但如果需要:

         sb.append("\r\n");

那剩下的就简单了:

         newText = sb.toString();
         textView.setText(newText);
     }

希望对您有所帮助:)