正则表达式匹配双引号内的每个字符串并包括转义引号

Regex match every string inside double quotes and include escaped quotation marks

已经有很多类似的问题,但 none 中的问题适用于我的情况。我有一个字符串,其中包含双引号内的多个子字符串,这些子字符串可以包含转义双引号。

例如字符串'And then, "this is some sample text with quotes and \"escaped quotes\" inside". Not that we need more, but... "here is \"another\" one". Just in case.',预期结果是一个有两个元素的数组;

/"(?:\"|[^"])*"/g 正则表达式在 regex101 上按预期工作;但是,当我使用 String#match() 时,结果是不同的。查看下面的代码片段:

let str = 'And then, "this is some sample text with quotes and \"escaped quotes\" inside". Not that we need more, but... "here is \"another\" one". Just in case.'
let regex = /"(?:\"|[^"])*"/g

console.log(str.match(regex))

我得到了四个,而不是两个匹配项,甚至不包括转义引号内的文本。

MDN mentions 如果使用 g 标志,将返回所有匹配完整正则表达式的结果,但不会返回捕获组。如果我想获取捕获组并且设置了全局标志,我需要使用RegExp.exec()。我试过了,结果是一样的:

let str = 'And then, "this is some sample text with quotes and \"escaped quotes\" inside". Not that we need more, but... "here is \"another\" one". Just in case.'
let regex = /"(?:\"|[^"])*"/g
let temp
let matches = []

while (temp = regex.exec(str))
  matches.push(temp[0])

console.log(matches)

如何获得包含这两个匹配元素的数组?

正则表达式无法按预期工作的原因是单个反斜杠是转义字符。您需要转义文本中的反斜杠:

let str = 'And then, "this is some sample text with quotes and \"escaped quotes\" inside". Not that we need more, but... "here is \"another\" one". Just in case.';
let regex = /"(?:\"|[^"])*"/g

console.log(str);
console.log(str.match(regex))

str = 'And then, "this is some sample text with quotes and \"escaped quotes\" inside". Not that we need more, but... "here is \"another\" one". Just in case.';

console.log(str);
console.log(str.match(regex))

另一种选择是没有 | 运算符的更优化的正则表达式:

const str = String.raw`And then, "this is some sample text with quotes and \"escaped quotes\" inside". Not that we need more, but... "here is \"another\" one". Just in case.`
const regex = /"[^"\]*(?:\[\s\S][^"\]*)*"/g
console.log(str.match(regex))

使用 String.raw,不需要转义引号两次。

参见 regex proof. Btw, 28 steps vs. 267 steps

解释

--------------------------------------------------------------------------------
  "                        '"'
--------------------------------------------------------------------------------
  [^"\]*                  any character except: '"', '\' (0 or more
                           times (matching the most amount possible))
--------------------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
--------------------------------------------------------------------------------
    \                       '\'
--------------------------------------------------------------------------------
    [\s\S]                   any character of: whitespace (\n, \r,
                             \t, \f, and " "), non-whitespace (all
                             but \n, \r, \t, \f, and " ")
--------------------------------------------------------------------------------
    [^"\]*                  any character except: '"', '\' (0 or
                             more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )*                       end of grouping
--------------------------------------------------------------------------------
  "                        '"'