ANTLR：贪婪规则问题

Question

我从未使用过 ANTLR 和生成语法，所以这是我的第一次尝试。

我有一种需要解析的自定义语言。这是一个例子：

-- This is a comment
CMD.CMD1:foo_bar_123
CMD.CMD2
CMD.CMD4:9 of 28 (full)
CMD.NOTES:
This is an note.
    A line 
      (1) there could be anything here foo_bar_123 & $ £ _ , . ==> BOOM
      (3) same here
CMD.END_NOTES:

简而言之，可能有 4 种类型的行：

1) -- comment
2) <section>.<command>
3) <section>.<command>: <arg>
4) <section>.<command>:
       <arg1>
       <arg2>
       ...
   <section>.<end_command>:

<section> is the literal "CMD"

<command> is a single word (uppercase, lowercase letters, numbers, '_')

<end_command> is the same word of <command> but preceded by the literal "end_"

<arg> could be any character

这是我目前所做的：

grammar MyGrammar;

/*
* Parser Rules
*/

root                : line+ EOF ;

line                : (comment_line | command_line | normal_line) NEWLINE;

comment_line        : COMMENT ;

command_line        : section '.' command ((COLON WHITESPACE*)? arg)? ;

normal_line         : TEXT ;

section             : CMD ;

command             : WORD ;

arg                 : TEXT ;

/*
* Lexer Rules
*/

fragment LOWERCASE  : [a-z] ;
fragment UPPERCASE  : [A-Z] ;
fragment DIGIT      : [0-9] ;

NUMBER          : DIGIT+ ([.,] DIGIT+)? ;

CMD             : 'CMD';

COLON           : ':' ;

COMMENT         : '--' ~[\r\n]*;

WHITESPACE      : (' ' | '\t') ;

NEWLINE         : ('\r'? '\n' | '\r')+;

WORD            : (LOWERCASE | UPPERCASE | NUMBER | '_')+ ;

TEXT            : ~[\r\n]* ;

这是对我语法的测试：

$antlr4 MyGrammar.g4

warning(146): MyGrammar.g4:45:0: non-fragment lexer rule TEXT can match the empty string

$javac MyGrammar*.java

$grun MyGrammar root -tokens

CMD.NEW

[@0,0:6='CMD.NEW',<TEXT>,1:0]

[@1,7:7='\n',<NEWLINE>,1:7]

[@2,8:7='<EOF>',<EOF>,2:0]

问题是 "CMD.NEW" 被 TEXT 吞没了，因为那个规则是贪婪的。

谁能帮我解决这个问题？谢谢

Answer 1

存在语法歧义。

在您提供的示例中，CMD.NEW 可以匹配 command_line 和 normal_line。
因此，给定表达式：

 line                : (comment_line | command_line | normal_line) NEWLINE;

解析器不能确定接受什么规则（command_line或normal_line），所以它将它匹配到normal_line，这实际上是一个简单的TEXT。

考虑以解析器始终可以说出要接受的规则的方式重写语法。

更新：

试试这个（我没有测试过，但应该可以）：

grammar MyGrammar;

/*
* Parser Rules
*/

root                : line+ EOF ;

line                : (comment_line | command_line) NEWLINE;

comment_line        : COMMENT ;

command_line        : CMD '.' (note_cmd | command);

command             : command_name ((COLON WHITESPACE*)? arg)? ;

note_cmd            : notes .*? (CMD '.' END_NOTES) ;

command_name             : WORD ;

arg                 : TEXT ;

/*
* Lexer Rules
*/

fragment LOWERCASE  : [a-z] ;
fragment UPPERCASE  : [A-Z] ;
fragment DIGIT      : [0-9] ;

NUMBER          : DIGIT+ ([.,] DIGIT+)? ;

CMD             : 'CMD';

COLON           : ':' ;

COMMENT         : '--' ~[\r\n]*;

WHITESPACE      : (' ' | '\t') ;

NEWLINE         : ('\r'? '\n' | '\r')+;

WORD            : (LOWERCASE | UPPERCASE | NUMBER | '_')+ ;

NOTES            : 'NOTES';

END_NOTES        : 'END_NOTES';

TEXT            : ~[\r\n]* ;

ANTLR：贪婪规则问题

ANTLR : issue with greedy rule

grammar

parsing

antlr

lexer