这种模式如何在不转义的情况下匹配连字符?
How does this pattern match hyphen without escape?
在 regex101 中摸索了几分钟后,我意识到 ]
不需要转义,如果它立即跟随 [
.
在regex101中,模式[]-a-z]
被描述为
/[]-a-z]/
[]-a-z] match a single character present in the list below
]-a a single character in the range between ] and a (case sensitive)
-z a single character in the list -z literally (case sensitive)
但我一直认为,如果-
非要按字面匹配不转义,那应该either go at the beginning, or at end。
那么为什么我的模式没有被识别为错误?为什么 -z
按字面意思匹配列表 -z
中的单个字符?
正则表达式不会失败,因为 -
在这里表示一个范围,从 ]
到 a
。 ]
不必在字符 class 内的起始位置进行转义,因此这里将其视为文字。字符 class 是有效的,因为 ]
有一个 93
ASCII 码,而 a
在 ASCII table 中有一个 97
码。
编辑:
正则表达式有一个普遍性:它们是从左到右分析的。因此,范围是使用第一个连字符周围的第一个字符形成的。第二个连字符紧跟在范围结束字符之后,它不能用作起始范围字符,因为它是 "occupied"。因此,正则表达式引擎只能将第二个连字符解析为文字
The minus (hyphen) character can be used to specify a range of
charac-
ters in a character class. For example, [d-m] matches any letter
between d and m, inclusive. If a minus character is required in a
class, it must be escaped with a backslash or appear in a position
where it cannot be interpreted as indicating a range, typically as the
first or last character in the class, or immediately after a range. For
example, [b-d-z] matches letters in the range b to d, a hyphen charac-
ter, or z.
Hyphens at other positions in character classes where they can't form
a range may be interpreted as literals or as errors. Regex flavors are
quite inconsistent about this.
因此,这里 -
无法形成范围,因为前一个标记是范围而不是字符,因此它被解释为文字 -
让我们分解一下:
[]-a-z]
^^ ^
|| +---- 3
|+------ 2
+------- 1
1
是文字 ]
因为它出现在模式的开头,[]
是 PCRE 中的无效字符 class。
2
连字符因此是 class 中的第二个字符,并引入了一个介于 ]
和 a
之间的范围。
下一个连字符 3
按字面处理,因为前一个标记 a
是前一个范围的结尾。此时不能引入另一个范围。在 PCRE 中,如果 -
位于无法引入范围的位置或被转义,则按字面处理。我们通常在范围的开头或结尾放置文字连字符以使其明显,但这不是必需的。
那么,z
就是一个简单的文字。
PCRE 遵循 Perl 语法。这是 documented 像这样:
关于]
:
A ]
is normally either the end of a POSIX character class (see POSIX Character Classes below), or it signals the end of the bracketed character class. If you want to include a ]
in the set of characters, you must generally escape it.
However, if the ]
is the first (or the second if the first character is a caret) character of a bracketed character class, it does not denote the end of the class (as you cannot have an empty class) and is considered part of the set of characters that can be matched without escaping.
关于连字符:
If a hyphen in a character class cannot syntactically be part of a range, for instance because it is the first or the last character of the character class, or if it immediately follows a range, the hyphen isn't special, and so is considered a character to be matched literally. If you want a hyphen in your set of characters to be matched and its position in the class is such that it could be considered part of a range, you must escape that hyphen with a backslash.
请注意,这是指 Perl 语法。其他风格可能有不同的行为。例如,[]
是 JavaScript 中的有效(空)字符 class,无法匹配任何内容。
要注意的是,根据选项,PCRE 也可以用 JS 方式解释它(有几个 JS 兼容性标志)。来自 PCRE2 docs:
An opening square bracket introduces a character class, terminated by a closing square bracket. A closing square bracket on its own is not special by default. If a closing square bracket is required as a member of the class, it should be the first data character in the class (after an initial circumflex, if present) or escaped with a backslash. This means that, by default, an empty class cannot be defined. However, if the PCRE2_ALLOW_EMPTY_CLASS
option is set, a closing square bracket at the start does end the (empty) class.
毫不奇怪,有关连字符的 PCRE 行为与 Perl 行为相匹配:
The minus (hyphen) character can be used to specify a range of characters in a character class. For example, [d-m]
matches any letter between d and m, inclusive. If a minus character is required in a class, it must be escaped with a backslash or appear in a position where it cannot be interpreted as indicating a range, typically as the first or last character in the class, or immediately after a range. For example, [b-d-z]
matches letters in the range b
to d
, a hyphen character, or z
.
在 regex101 中摸索了几分钟后,我意识到 ]
不需要转义,如果它立即跟随 [
.
在regex101中,模式[]-a-z]
被描述为
/[]-a-z]/ []-a-z] match a single character present in the list below ]-a a single character in the range between ] and a (case sensitive) -z a single character in the list -z literally (case sensitive)
但我一直认为,如果-
非要按字面匹配不转义,那应该either go at the beginning, or at end。
那么为什么我的模式没有被识别为错误?为什么 -z
按字面意思匹配列表 -z
中的单个字符?
正则表达式不会失败,因为 -
在这里表示一个范围,从 ]
到 a
。 ]
不必在字符 class 内的起始位置进行转义,因此这里将其视为文字。字符 class 是有效的,因为 ]
有一个 93
ASCII 码,而 a
在 ASCII table 中有一个 97
码。
编辑:
正则表达式有一个普遍性:它们是从左到右分析的。因此,范围是使用第一个连字符周围的第一个字符形成的。第二个连字符紧跟在范围结束字符之后,它不能用作起始范围字符,因为它是 "occupied"。因此,正则表达式引擎只能将第二个连字符解析为文字
The minus (hyphen) character can be used to specify a range of charac- ters in a character class. For example, [d-m] matches any letter between d and m, inclusive. If a minus character is required in a class, it must be escaped with a backslash or appear in a position where it cannot be interpreted as indicating a range, typically as the first or last character in the class, or immediately after a range. For example, [b-d-z] matches letters in the range b to d, a hyphen charac- ter, or z.
Hyphens at other positions in character classes where they can't form a range may be interpreted as literals or as errors. Regex flavors are quite inconsistent about this.
因此,这里 -
无法形成范围,因为前一个标记是范围而不是字符,因此它被解释为文字 -
让我们分解一下:
[]-a-z]
^^ ^
|| +---- 3
|+------ 2
+------- 1
1
是文字 ]
因为它出现在模式的开头,[]
是 PCRE 中的无效字符 class。
2
连字符因此是 class 中的第二个字符,并引入了一个介于 ]
和 a
之间的范围。
下一个连字符 3
按字面处理,因为前一个标记 a
是前一个范围的结尾。此时不能引入另一个范围。在 PCRE 中,如果 -
位于无法引入范围的位置或被转义,则按字面处理。我们通常在范围的开头或结尾放置文字连字符以使其明显,但这不是必需的。
那么,z
就是一个简单的文字。
PCRE 遵循 Perl 语法。这是 documented 像这样:
关于]
:
A
]
is normally either the end of a POSIX character class (see POSIX Character Classes below), or it signals the end of the bracketed character class. If you want to include a]
in the set of characters, you must generally escape it.
However, if the]
is the first (or the second if the first character is a caret) character of a bracketed character class, it does not denote the end of the class (as you cannot have an empty class) and is considered part of the set of characters that can be matched without escaping.
关于连字符:
If a hyphen in a character class cannot syntactically be part of a range, for instance because it is the first or the last character of the character class, or if it immediately follows a range, the hyphen isn't special, and so is considered a character to be matched literally. If you want a hyphen in your set of characters to be matched and its position in the class is such that it could be considered part of a range, you must escape that hyphen with a backslash.
请注意,这是指 Perl 语法。其他风格可能有不同的行为。例如,[]
是 JavaScript 中的有效(空)字符 class,无法匹配任何内容。
要注意的是,根据选项,PCRE 也可以用 JS 方式解释它(有几个 JS 兼容性标志)。来自 PCRE2 docs:
An opening square bracket introduces a character class, terminated by a closing square bracket. A closing square bracket on its own is not special by default. If a closing square bracket is required as a member of the class, it should be the first data character in the class (after an initial circumflex, if present) or escaped with a backslash. This means that, by default, an empty class cannot be defined. However, if the
PCRE2_ALLOW_EMPTY_CLASS
option is set, a closing square bracket at the start does end the (empty) class.
毫不奇怪,有关连字符的 PCRE 行为与 Perl 行为相匹配:
The minus (hyphen) character can be used to specify a range of characters in a character class. For example,
[d-m]
matches any letter between d and m, inclusive. If a minus character is required in a class, it must be escaped with a backslash or appear in a position where it cannot be interpreted as indicating a range, typically as the first or last character in the class, or immediately after a range. For example,[b-d-z]
matches letters in the rangeb
tod
, a hyphen character, orz
.