正则表达式中的贪婪运算符在 Tcl 8.5 中不起作用

greedy operator in regular expression is not working in Tcl 8.5

查看这个简单的正则表达式代码:

puts [ regexp -inline {^\-\-\S+?=\S+} "--tox=9.0" ]

输出为:

 >--tox=9

看来第二个\S+是非贪心的!仅匹配 1 个字符
在 PERL 中,可以看到结果如我所料,见 1 行输出:

perl -e '"--tox=9.0" =~/(^\-\-\S+?=\S+)/ ; print "\n"'
--tox=9.0

如何获得 Tcl 中的 Perl 行为?

Tcl 正则表达式引擎是一种自动机理论引擎,而不是基于堆栈的引擎,因此它具有非常 不同的方法来匹配混合贪婪 RE。特别是,对于您正在谈论的那种 RE,这将被解释为完全非贪婪。

解决此问题的最简单方法是使用不同的 RE。记住 \S 只是 [^\s] 的 shorthand,我们可以这样做(不包括第一部分的 =):

puts [ regexp -inline {^--[^\s=]+=\S+} "--tox=9.0" ]

(我还将 \- 更改为 -,因为它在 Tcl 的 RE 中不是特殊字符。)

答案可见here:

Unfortunately, the answer is that to get the same answer Perl gives, you have to use Perl's exact regexp implementation.

在你的情况下,我会同时使用锚点,^$:

  puts [ regexp -inline {^\-\-\S+?=\S+$} "--tox=9.0" ]

结果是:--tox=9.0

这是 Tcl 正则表达式实现的固有 'feature'。例如,below 来自 Henry Spencer(我相信即使不是全部 Tcl 的正则表达式工作,也是他所做的最多的人)

It is very difficult to come up with an entirely satisfactory definition of the behavior of mixed-greediness regular expressions. Perl doesn't try: the Perl "specification" is a description of the implementation, an inherently low-performance approach involving trying one match at a time. This is unsatisfactory for a number of reasons, not least being that it takes several pages of text merely to describe it. (That implementation and its description are distant, mutated descendants of one of my earlier regexp packages, so I share some of the blame for this.)

When all quantifiers are greedy, the Tcl 8.2 regexp matches the longest possible match (as specified in the POSIX standard's regular-expression definition). When all are non-greedy, it matches the shortest possible match. Neither of these desirable statements is true of Perl.

The trouble is that it is very, very hard to write a generalization of those statements which covers mixed-greediness regular expressions -- a proper, implementation-independent definition of what mixed-greediness regular expressions should match -- and makes them do "what people expect". I've tried. I'm still trying. No luck so far.

The rules in the Tcl 8.2 regexp, which basically give the whole regexp a long/short preference based on its subexpressions, are the best I've come up with so far. The code implements them accurately. I agree that they fall short of what's really wanted. It's trickier than it looks.

基本上,混合贪婪和非贪婪量词的表达式会影响实现的简单性和性能。因此,实现使得量词的第一个 'type' 传递给所有其他量词。

换句话说,如果第一个量词是贪心的,那么其他的都是贪心的。如果第一个是非贪婪的,则所有其他人都将是非贪婪的。因此,您不能强制 Tcl 正则表达式像 Perl 正则表达式一样工作(或者也许您可以通过 exec 并使用 perl 的 bash 命令版本,但我对此并不熟悉)。

我建议使用否定的 类 and/or 锚而不是非贪婪的。

由于我不知道你问题的确切上下文,我不会提供替代的正则表达式,因为这将取决于这是否真的是你试图匹配的整个字符串。