正则表达式以最少的单词数匹配引号

Regex to match quote with minimum number of words

我有以下文字:

Attorney General William Barr said the volume of information compromised was “staggering” and the largest breach in U.S. history.“This theft not only caused significant financial damage to Equifax but invaded the privacy of many, millions of Americans and imposed substantial costs and burdens on them as they had to take measures to protect themselves from identity theft,” said Mr. Barr.

我想匹配引号中的文本,但引号必须至少包含 5 个字,否则将被忽略。

目前,我正在使用以下正则表达式:

(?<=[\“|\"])[A-Za-z0-9\.\-][A-Za-z\s,:\’]+(?=[\”|\"])

但是,这将包括只有 1 个单词的引述“staggering”,因此应将其忽略。

我意识到我可以通过将正则表达式的这一部分重复 5 次来完成此操作:

[A-Za-z\s,:\’]+[A-Za-z\s,:\’]+[A-Za-z\s,:\’]+[A-Za-z\s,:\’]+[A-Za-z\s,:\’]+

但是,我想知道是否有更短更简洁的方法来实现这一点?也许通过强制 [] 中的 \s 至少出现 5 次?

谢谢

你需要 "unroll" 字符 class 通过从中取出空白匹配模式,并使用 [<chars>]+(?:\s+[<chars>]+){4,} 类似的模式。请注意,您不应在此处使用环视,因为 " 既可以作为前导标记也可以作为尾随标记,这可能会导致不需要的匹配。请改用捕获组并通过 matcher.group(1).

访问其值

您可以使用

String regex = "[“\"]([A-Za-z0-9.-][A-Za-z,:’]*(?:\s+[A-Za-z0-9.-][A-Za-z,:’]*){4,})[”\"]";

参见regex demo

然后,只获取第 1 组值:

String line = "Attorney General William Barr said the volume of information compromised was “staggering” and the largest breach in U.S. history.“This theft not only caused significant financial damage to Equifax but invaded the privacy of many, millions of Americans and imposed substantial costs and burdens on them as they had to take measures to protect themselves from identity theft,” said Mr. Barr.";
String regex = "[“\"]([A-Za-z0-9.-][A-Za-z,:’]*(?:\s+[A-Za-z0-9.-][A-Za-z,:’]*){4,})[”\"]";
Matcher m = Pattern.compile(regex).matcher(line);
List<String> res = new ArrayList<>();
while(m.find()) {
    res.add(m.group(1));
}
System.out.println(res);

参见online Java demo

图案详情

  • [“"] - "
  • ([A-Za-z0-9.-][A-Za-z,:’]*(?:\s+[A-Za-z0-9.-][A-Za-z,:’]*){4,}) - 第 1 组:
    • [A-Za-z0-9.-][A-Za-z,:’]* - 一个 ASCII 字母数字或 .-,然后是 0+ 个 ASCII 字母,,: 个字符
    • (?:\s+[A-Za-z0-9.-][A-Za-z,:’]*){4,} - 出现四次或更多次
      • \s+ - 1+ 个空格
        • [A-Za-z0-9.-][A-Za-z,:’]* - 一个 ASCII 字母数字或 .-,然后是 0+ 个 ASCII 字母,,: 个字符
  • [”"] - "

您需要使用符合您的情况的正确正则表达式。

下面的代码片段 匹配引号内的文本 5 个字长 ,

    Pattern pattern = Pattern.compile("“((\b\w+\b)+.?( *)){5,}”", Pattern.DOTALL);

    String input = "Attorney General William Barr said the volume of "+
    "information compromised was “staggering” and the largest breach in"+
     "U.S. history.“This theft not only caused significant financial "+
     "damage to Equifax but invaded the privacy of many, millions of"+
     "Americans and imposed substantial costs and burdens on them as "+
     "they had to take measures to protect themselves from identity theft,” said Mr. Barr.";

    Matcher m = pattern.matcher(input);

    while (m.find()) {
      String s = m.group();
      System.out.print(s);  
    }

注意:你需要设置一个utf8标志来编译那些特定的引号字符,''和''。 所以不用 javac TheClass.java 使用 javac -encoding utf8 TheClass.java!