Java Pattern.compile 忽略转义双引号 (\")

Java Pattern.compile ignoring escaped double quotes (\")

我很难找出忽略转义引号的模式。 我想要这个:

    "10\" 2 Topping Pizza, Pasta, or Sandwich for  each. Valid until 2pm. Carryout only.","blah blah" 

匹配为:

   1> "10\" 2 Topping Pizza, Pasta, or Sandwich for  each. Valid until 2pm. Carryout only."
   2> "blah blah" 

我一直在尝试这个:

    Pattern pattern = Pattern.compile("\"[^\"]*\"");
    Matcher matcher = pattern.matcher(filteredCoupons);

我明白了

   1> "10\"
   2> "," 

您要查找的正则表达式是

"[^"\]*(?:\.[^"\]*)*"

demo

在Java,

String pattern = "\"[^\"\\]*(?:\\.[^\"\\]*)*\"";

您的正则表达式似乎需要接受非引号或前面有 \ 的引号。在那种情况下尝试

Pattern pattern = Pattern.compile("\"(\\.|[^\"])*\"");

这部分正则表达式 \\.|[^\"] 将尝试查找

  • \. - 任何转义字符,
  • (| 或) [^\"] - 任何非引号字符

我将 \. 放在 [^\"] 之前,以防止 \[^\"] 匹配。

换句话说,对于 foo\"bar" 和正则表达式 \\.|[^\"] 这样的文本,您将获得此匹配项

foo\"bar"
^^^-matched by [^\"]

foo\"bar"
   ^^-matched by \.

foo\"bar"
     ^^^-matched by [^\"]

foo\"bar"
        ^-can't be matched by anything since there is no \ before
          nor it is non-quote

演示:

String filteredCoupons = "\"10\\" 2 Topping Pizza, Pasta, or Sandwich for  each. Valid until 2pm. Carryout only.\",\"blah blah\"";
Pattern pattern = Pattern.compile("\"(\\.|[^\"])*\"");
Matcher matcher = pattern.matcher(filteredCoupons);
while(matcher.find()){
    System.out.println(matcher.group());
}

输出:

"10\" 2 Topping Pizza, Pasta, or Sandwich for  each. Valid until 2pm. Carryout only."
"blah blah"

也可以用负数lookbehind:

(?s)".*?"(?<!\.)

作为 Java 字符串:

"(?s)\".*?\"(?<!\\.)"

参见test at regex101; test at regexplanet(点击"Java")

  • 遇到 " 后,如果没有前面的反斜杠跳过一个字符,它会向后看
  • 类似".*?(?<!\)",但在遇到"
  • 后回头看性能更好
  • 使用 (?s) 标志使点也匹配换行符

出于兴趣,我用 regexhero.net (thanks @stribizhev for this link!). Was unsure if the stepscounter of regex101 处的示例字符串对不同版本进行了基准测试,这里是准确的。

基准测试仅使用非捕获组。有趣的是,"(?:\.|[^"])*" 的性能几乎是捕获组 "(\.|[^"])*".

的两倍