ANTLR 如何决定应用哪个词法分析器规则?最长匹配的词法分析器规则获胜?

How does ANTLR decide which lexer rule to apply? The longest matching lexer rule wins?

输入内容:

语法:

grammar test;

p : EOF;

Char : [a-z];

fragment Tab : '\t';
fragment Space : ' ';
T1 : (Tab|Space)+ ->skip;

T2 : '#' T1+ Char+;

匹配结果是这样的:

[@0,0:6='#   abc',<T2>,1:0]    <<<<<<<< PLACE 1
[@1,7:6='<EOF>',<EOF>,1:7]
line 1:0 extraneous input '#   abc' expecting <EOF>

请忽略最后一行的错误。我想知道为什么在 PLACE 1 匹配的令牌是 T2

在语法文件中,T2 词法分析器规则 T1 词法分析器规则之后。所以我希望 T1 规则应该首先应用。那么为什么# abc中的空格没有被跳过呢?

ANTLR 是否使用某种贪心策略来匹配当前字符流与最长的词法分析器规则?

我认为这是因为 ANTLR 在词法分析器阶段首先看到 # 字符,所以它会遵循 T2 规则。幸运的是,T2 匹配。如果 ANTLR 首先看到 </code>,它将遵循 <code>T1 规则。

三个规则适用,顺序如下:

  1. 最长的比赛先获胜。
  2. 接下来规则匹配隐式标记(如语法中的 #)。
  3. 最后,如果出现平局(按匹配长度),则匹配规则中最早列出的规则获胜。

经过半夜的搜索,我再次找到其中的大部分内容 material 来自 Sam Harwell 的一篇冗长的引述,他在其中还阐述了贪婪运算符的影响。我记得第一次看到它并在我的 TDAR 副本中草拟笔记,但没有参考。

ANTLR 4 lexers normally operate with longest-match-wins behavior, without any regard for the order in which alternatives appear in the grammar. If two lexer rules match the same longest input sequence, only then is the relative order of those rules compared to determine how the token type is assigned.

The behavior within a rule changes as soon as the lexer reaches a non-greedy optional or closure. From that moment forward to the end of the rule, all alternatives within that rule will be treated as ordered, and the path with the lowest alternative wins. This seemingly strange behavior is actually responsible for the non-greedy handling due to the way we order alternatives in the underlying ATN representation. When the lexer is in this mode and reaches the block (ESC|.), the ordering constraint requires it use ESC if possible.

“隐式令牌”规则来自

回应...

Does ANTLR uses some greedy strategy to match current character stream with the longest lexer rule?

...我将引用 ANTLR4 的 Wildcard Operator and Nongreedy Subrules 文档。

Here is how the lexer chooses token rules:

  1. The primary goal is to match the lexer rule that recognizes the most input characters.
INT : [0-9]+ ;
DOT : '.' ; // match period
FLOAT : [0-9]+ '.' ; // match FLOAT upon '34.' not INT then DOT
  1. If more than one lexer rule matches the same input sequence, the priority goes to the rule occurring first in the grammar file.
DOC : '/**' .*? '*/' ; // both rules match /** foo */, resolve to DOC
CMT : '/*' .*? '*/' ;
  1. Nongreedy subrules match the fewest number of characters that still allows the surrounding lexical rule to match.
/** Match anything except \n inside of double angle brackets */
STRING : '<<' ~'\n'*? '>>' ; // Input '<<foo>>>>' matches STRING then END
END    : '>>' ;
  1. After crossing through a nongreedy subrule within a lexical rule, all decision-making from then on is "first match wins."

For example, literal ab in rule right-hand side (grammar fragment) .*? ('a'|'ab') is dead code and can never be matched. If the input is ab, the first alternative, 'a', matches the first character and therefore succeeds. ('a'|'ab') by itself on the right-hand side of a rule properly matches the second alternative for input ab. This quirk arises from a nongreedy design decision that’s too complicated to go into here.

如果您了解规则 1、2 和 3,您可能会没事。第四条法则深奥

仅根据上面引用的信息,我没有看到关于隐式令牌规则适用于何处的明确答案。当我找到更多信息时,我将更新此答案。

我鼓励您也查看 ,其中更多地讨论了隐式令牌规则。

(另外:在我看来,如果将上面引用的内容合并到 the lexer rules docs 中,可能会更容易被发现和理解。)