使用 antlr4 的词法分析器模式解析内联和多行语句

Question

我目前正在研究 Island 语法分析器，用于分析同一文件中的两种编程语言 (DSL)。第二种编程语言的语句总是以特殊字符 (*) 开头，但它们可以采用两种形式：内联语句或多行。

如果是内联语句，该行以 * 开头并以换行符 (\r?\n) 结束。

在多行的情况下，语句以 * 开头，语句可能扩展为多行，后跟分号。

我很难使用 Antlr4 的词法分析器模式来完成此任务。有人能给我指出正确的方向吗？

我在下面给出了我的语法。解析器针对以下示例显示两个错误

line 5:21 extraneous input '\n' expecting {<EOF>, ID, SWITCH_CHAR}
line 8:31 extraneous input '\n' expecting {<EOF>, ID, SWITCH_CHAR}

示例：

first programming language 
*example one second programming language inline statement ending with semicolon;
*example two another valid second programming language inline statement ending with newline
*example three second programming language may expand to the next line
until semicolon char;
*example four second programming language example may expand 
to a number of lines
too ending with semicolon char;
first programming language again

词法分析器：

lexer grammar ComplexLanguageLexer;
/*** SEA ****/
ID: [a-z]+;
WS: [ \t\f]+ -> skip;
SWITCH_CHAR: '*' -> pushMode(inline_mode), pushMode(multiline_mode);
NEWLINE:  '\r'? '\n' -> skip;

/***ISLANDS****/
mode multiline_mode;
MULTILINE_SWITCH_CHAR: ';' -> popMode;  //seek until ';'
MULTILINE_ID: [a-z]+;
MULTILINE_WS: [ \t\f]+ -> skip;
MULTILINE_NEWLINE:  '\r'? '\n' -> skip; //just skip newlines in the multiline mode

mode inline_mode;
INLINE_NEWLINE:  '\r'? '\n' -> type(NEWLINE), popMode;
INLINE_SEMICOLONCHAR: ';' ; //just match semicolonchar
INLINE_ID: [a-z]+;
INLINE_WS: [ \t\f]+ -> skip;

语法：

parser grammar ComplexLanguageParser;
options { tokenVocab = ComplexLanguageLexer ; }

startRule:   programStatement+;

programStatement:
    word | inlineStatement| multilineStatement
;

word: ID;

inlineStatement:
    SWITCH_CHAR INLINE_ID+ INLINE_SEMICOLONCHAR? NEWLINE
;

multilineStatement:
    SWITCH_CHAR MULTILINE_ID+ MULTILINE_SWITCH_CHAR
;

更新

我已经按照@GRosenberg 的说明更新了 lexer/parser 语法：

词法分析器

lexer grammar ComplexLanguageLexer;

SWITCH_CHAR: STAR -> pushMode(second_mode) ;
ID1         : ID ;
WS1         : WS -> skip ;
NL1         : NL -> skip ;

fragment STAR : '*' ;
fragment ID   : [a-z]+ ;
fragment WS   : [ \t\f]+ ;
fragment NL   : '\r'? '\n' ;

mode second_mode ;
    TERM1 : ( WS | NL )* SEMI -> popMode ;
    TERM2 : WS* NL -> popMode ;
    ID2   : ID ;
    WS2   : WS+ -> skip ;
    NL2   : NL;
    SEMI : ';';

语法

parser grammar ComplexLanguageParser;
options { tokenVocab = ComplexLanguageLexer ; }

startRule:  programStatement+;
programStatement:   firstLanguageStatement | secondLanguageStatment ;
firstLanguageStatement:    word ;
secondLanguageStatment:    SWITCH_CHAR (inlineStatement| multilineStatement)     ;
word: ID1;
multilineStatement:    (ID2|NL2)+ TERM1;
inlineStatement:   ID2+ TERM2;

它对内联语句按预期工作，但对多行语句仍然不起作用。不确定我在这里做错了什么？

例如

first language            -> ok
*second language inline   -> ok 
*multi line;              -> ok
*multi line expands to 
 next line;                ->  token recognition error at ';'
*multi line
;                          -> ok
first language again       -> ok

Answer 1

pushMode 和 popMode 命令是使用单个堆栈实现的。所以，规则

SWITCH_CHAR: '*' -> pushMode(inline_mode), pushMode(multiline_mode);

应该导致词法分析器评估 multiline_mode 规则。在弹出时，词法分析器将评估 inline_mode 规则。不太可能是你想要的。

最好实施能够正确处理所有第二语言语句的单个词法分析器模式。基本思路是：

SWITCH_CHAR : STAR -> pushMode(second_mode) ;

mode second_mode ;
    STMT1 : ( ID | WS | NL )+ SEMI -> popMode() ;
    STMT2 : ( ID | WS )+ NL -> popMode() ;

未经测试，但如果 ID 不包括 STAR 或 SEMI。

应该可以工作

更新

要将 ID 暴露给解析器，只需将其从语句规则中分离出来即可：

SWITCH_CHAR: STAR -> pushMode(second_mode) ;
ID1         : ID ;
WS1         : WS -> skip ;
NL1         : NL -> skip ;

fragment STAR : '*' ;
fragment ID   : [a-z]+ ;
fragment WS   : [ \t\f]+ ;
fragment NL   : '\r'? '\n' ;

mode second_mode ;
    TERM1 : ( WS | NL )* SEMI -> popMode() ;
    TERM2 : WS+ NL -> popMode() ;
    ID2   : ID ;
    WS2   : WS+ -> skip ;

然而，这会产生歧义：

 *example two inline statement ending with newline
 first programming language again (including a semicolon)

如果这是有效的，那么不使用本机代码就没有足够的结构来消除歧义。

在去那里之前，一个可能更好的设计选择是将第一语言和第二语言之间的任何区别推迟到解析器，或者更好的是，对生成的解析树进行分析。

使用 antlr4 的词法分析器模式解析内联和多行语句

Parsing inline and multi-line statements using antlr4's lexer modes

antlr

antlr4