为什么 LF 和 CRLF 与 /^\s*$/gm 正则表达式的行为不同？

Question

我在 Windows 上看到过这个问题。当我尝试清除 Unix 上每一行的任何空格时：

const input =
`===

HELLO

WOLRD

===`
console.log(input.replace(/^\s+$/gm, ''))

这产生了我所期望的结果：

===

HELLO

WOLRD

===

即如果空行上有个空格，它们就会被删除。另一方面，在 Windows 上，正则表达式清除了整个字符串。举例说明：

const input =
`===

HELLO

WOLRD

===`.replace(/\r?\n/g, '\r\n')
console.log(input.replace(/^\s+$/gm, ''))

(模板文字在 JS 中总是只打印 \n，所以我不得不用 \r\n 替换来模拟 Windows（? 在 \r 之后只是为了那些不相信的人。结果：

===
HELLO
WOLRD
===

整行没了！但是我的正则表达式有 ^ 和 $ 并且设置了 m 标志，所以它有点像 /^-to-$/m。 \r 和 \r\n 之间的区别是什么导致它产生不同的结果？

当我做一些记录时

console.log(input.replace(/^\s*$/gm, (m) => {
  console.log('matched')
  return ''
}))

有了\r\n我看到了

matched
matched
matched
matched
matched
matched
===
HELLO
WOLRD
===

并且只有 \n

matched
matched
matched
===

HELLO

WOLRD

===

Answer 1

TL;DR 包含空格和换行符的模式也将匹配 \r\n 序列的字符部分，如果你让它吧。

首先，让我们实际检查一下替换时有哪些字符，哪些字符没有。以仅使用换行符的字符串开头：

const inputLF =
`===

HELLO

WOLRD

===`.replace(/\r?\n/g, "\n");

console.log('------------ INPUT ')
console.log(inputLF);
console.log('------------')

debugPrint(inputLF, 2);
debugPrint(inputLF, 3);
debugPrint(inputLF, 4);
debugPrint(inputLF, 5);

const replaceLF = inputLF.replace(/^\s+$/gm, '');

console.log('------------ REPLACEMENT')
console.log(replaceLF);
console.log('------------')

debugPrint(replaceLF, 2);
debugPrint(replaceLF, 3);
debugPrint(replaceLF, 4);
debugPrint(replaceLF, 5);

console.log(`charcode ${replaceLF.charCodeAt(2)} : ${replaceLF.charAt(2)}`);
console.log(`charcode ${replaceLF.charCodeAt(3)} : ${replaceLF.charAt(3)}`);
console.log(`charcode ${replaceLF.charCodeAt(4)} : ${replaceLF.charAt(4)}`);
console.log(`charcode ${replaceLF.charCodeAt(5)} : ${replaceLF.charAt(5)}`);

console.log('------------')
console.log('inputLF === replaceLF :', inputLF === replaceLF)

function debugPrint(str, charIndex) {
  console.log(`index: ${charIndex}
   charcode: ${str.charCodeAt(charIndex)}
   character: ${str.charAt(charIndex)}`
 );
}

每一行都以字符代码 10 结尾，这是换行 (LF) 字符，用 \n 表示在字符串文字中。替换前后，两个字符串是相同的——不仅看起来相同，而且实际上彼此相等，所以替换什么也没做。

现在让我们检查另一种情况：

const inputCRLF =
`===

HELLO

WOLRD

===`.replace(/\r?\n/g, "\r\n")
console.log('------------ INPUT ')
console.log(inputCRLF);
console.log('------------')

debugPrint(inputCRLF, 2);
debugPrint(inputCRLF, 3);
debugPrint(inputCRLF, 4);
debugPrint(inputCRLF, 5);
debugPrint(inputCRLF, 6);
debugPrint(inputCRLF, 7);

const replaceCRLF = inputCRLF.replace(/^\s+$/gm, '');;

console.log('------------ REPLACEMENT')
console.log(replaceCRLF);
console.log('------------')

debugPrint(replaceCRLF, 2);
debugPrint(replaceCRLF, 3);
debugPrint(replaceCRLF, 4);
debugPrint(replaceCRLF, 5);

function debugPrint(str, charIndex) {
  console.log(`index: ${charIndex}
   charcode: ${str.charCodeAt(charIndex)}
   character: ${str.charAt(charIndex)}`
 );
}

这次每行都以字符代码 13 结尾，它是 Carriage Return (CR) 字符，用 \r 和 then[=122 表示在字符串文字中=] LF 紧随其后。替换后，不再是 =\r\n\r\nH 的序列，而不仅仅是 =\r\nH。让我们看看为什么。

Here is what MDN says关于元字符^:

Matches the beginning of input. If the multiline flag is set to true, also matches immediately after a line break character.

这是 MDN 关于元字符的说法 $

Matches the end of input. If the multiline flag is set to true, also matches immediately before a line break character.

所以它们匹配 after 和 before 一个换行符。其中，MDN 表示 LF 或 CR。如果我们测试包含不同换行符的字符串，可以看出这一点：

const stringLF = "hello\nworld"; const stringCRLF = "hello\r\nworld"; const regexStart = /^\s/m; const regexEnd = /\s$/m; console.log(regexStart.exec(stringLF)); console.log(regexStart.exec(stringCRLF)); console.log(regexEnd.exec(stringLF)); console.log(regexEnd.exec(stringCRLF));

如果我们尝试匹配换行符附近的空格，如果有 LF，这不会匹配任何内容，但它会匹配 CR 和 CRLF。所以，在那种情况下 $ 会匹配这里：

"hello\r\nworld" ^^ what `^\s` matches "hello\r\nworld" ^^ what `\s$` matches

因此 ^ 和 $ 都将 CRLF 序列中的任何一个识别为行尾。当您进行搜索和替换时，这将有所不同。由于您的正则表达式指定了 ^\s+$，这意味着当您有一行完全是 \r\n 时，那么 它匹配 。但出于一个不明显的原因：

const re = /^\s+$/m; const sringLF = "hello\n\nworld"; const stringCRLF = "hello\r\n\r\nworld"; console.log(re.exec(sringLF)); console.log(re.exec(stringCRLF));

因此，正则表达式不匹配 \r\n，而是匹配其他两个换行符之间的 \n\r（两个空白字符）。那是因为 + 很急切，会尽可能多地消耗字符序列。这是正则表达式引擎将尝试的。为简洁起见进行了一些简化：

input = "hello\r\n\r\nworld regex = /^\s+$/ Step 1 hello[\r]\n\r\nworld matches `^`, symbol satisfied -> continue with next symbol in regex Step 2 hello[\r\n]\r\nworld matches `^\s+` -> continue matching to satisfy `+` quantifier Step 3 hello[\r\n\r]\nworld matches `^\s+` -> continue matching to satisfy `+` quantifier Step 4 hello[\r\n\r\n]world matches `^\s+` -> continue matching to satisfy `+` quantifier Step 5 hello[\r\n\r\nw]orld does not match `\s` -> backtrack Step 6 hello[\r\n\r\n]world matches `^\s+`, quantifier satisfied -> continue to next symbol in regex Step 7 hello[\r\n\r\nw]orld does not match `$` in `^\s+$` -> backtrack Step 8 hello[\r\n\r\n]world matches `^\s+$`, last symbol satisfied -> finish

最后，这里隐藏了一些东西 - 匹配空格很重要。这是因为它与大多数其他符号的行为不同，因为它明确匹配换行符，而 . will not:

Matches any single character except line terminators

因此，如果您指定 \s$，此将匹配 \r\n 中的 CR，因为正则表达式引擎被迫为两者寻找匹配项 \s 和 $，因此它会在 \n 之前找到 \r。但是，对于许多其他模式，这不会发生，因为 $ 通常会在 before CR（或字符串末尾）时得到满足。

与 ^\s 相同，它会在 之后明确查找空白字符 一个换行符，该换行符满足 CRLF 中的 LF，但是如果您不寻找那个, 那么它会在 LF:
之后愉快地匹配

const stringLF = "hello\nworld"; const stringCRLF = "hello\r\nworld"; const regexStartAll = /^./mg; const regexEndAll = /.$/gm; console.log(stringLF.match(regexStartAll)); console.log(stringCRLF.match(regexStartAll)); console.log(stringLF.match(regexEndAll)); console.log(stringCRLF.match(regexEndAll));

因此，所有这些都意味着 ^\s+$ 有一些不直观的行为，但一旦您了解正则表达式引擎 完全匹配 您告诉它的内容就完全一致。

为什么 LF 和 CRLF 与 /^\s*$/gm 正则表达式的行为不同？

Why does LF and CRLF behave differently with /^\s*$/gm regex?

javascript

regex

newline

carriage-return

linefeed