消除 antlr4 词法分析中的歧义

Removing ambiguity in antlr4 lexing

一直在尝试把网上找的一个语法转成antlr4格式。原文语法在这里:https://github.com/cv/asp-parser/blob/master/vbscript.bnf.

简短版本: 我认为当前 运行 遇到的问题是由于词法分析阶段的歧义造成的。

例如,我将浮点文字的规则复制如下:

float_literal   : DIGIT* '.' DIGIT+ ( 'e' PLUS_OR_MINUS? DIGIT+ )?
               | DIGIT+ 'e' PLUS_OR_MINUS? DIGIT+;

在文件的更上方,我有一个字母定义:

LETTER: 'a'..'z';

好像是因为我在float字面量中使用了'e',那个字符不能被识别为字母?在我的研究中,我发现每个字母都有一个标记,所以字母会变成:

letter: A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z;

我会将 'e' 的任何实例替换为 E。但是此文件中有更长的字符串,例如“.and”。那么这种方法需要用 DOT A N D 替换类似的东西吗?这似乎根本不对。

我是不是做错了什么,或者我可以做些什么来避免这种歧义?

谢谢, 克雷格

完整语法如下。

grammar vbscript;
/*===== Character Sets =====*/

SPACES: ' ' -> skip;

DIGIT: '0'..'9';
SEMI_COLON: ':';
NEW_LINE_CHARACTER: [\r\n]+;
WHITESPACE_CHARACTER: [ \t];
LETTER: 'a'..'z';
QUOTE: '"';
HASH: '#';
SQUARE_BRACE: '[' | ']';
PLUS_OR_MINUS: [+-];
ANYTHING_ELSE: ~('"' | '#');
ws: WHITESPACE_CHARACTER;
id_tail: (DIGIT | LETTER | '_');
string_character: ANYTHING_ELSE | DIGIT | WHITESPACE_CHARACTER | SEMI_COLON | LETTER | PLUS_OR_MINUS | SQUARE_BRACE;
id_name_char: ANYTHING_ELSE | DIGIT | WHITESPACE_CHARACTER | SEMI_COLON | LETTER | PLUS_OR_MINUS;


/*===== terminals =====*/
whitespace: ws+ | '_' ws* new_line?;

comment_line   : '' | 'rem';

string_literal  : '"' ( string_character | '""' )* '"';

float_literal   : DIGIT* '.' DIGIT+ ( 'e' PLUS_OR_MINUS? DIGIT+ )?
               | DIGIT+ 'e' PLUS_OR_MINUS? DIGIT+;
id             : LETTER id_tail*
               | '[' id_name_char* ']';
iddot          : LETTER id_tail* '.'
               | '[' id_name_char* ']' '.'
               | 'and.'
               | 'byref.'
               | 'byval.'
               | 'call.'
               | 'case.'
               | 'class.'
               | 'const.'
               | 'default.'
               | 'dim.'
               | 'do.'
               | 'each.'
               | 'else.'
               | 'elseif.'
               | 'empty.'
               | 'end.'
               | 'eqv.'
               | 'erase.'
               | 'error.'
               | 'exit.'
               | 'explicit.'
               | 'false.'
               | 'for.'
               | 'function.'
               | 'get.'
               | 'goto.'
               | 'if.'
               | 'imp.'
               | 'in.'
               | 'is.'
               | 'let.'
               | 'loop.'
               | 'mod.'
               | 'new.'
               | 'next.'
               | 'not.'
               | 'nothing.'
               | 'null.'
               | 'on.'
               | 'option.'
               | 'or.'
               | 'preserve.'
               | 'private.'
               | 'property.'
               | 'public.'
               | 'redim.'
               | 'rem.'
               | 'resume.'
               | 'select.'
               | 'set.'
               | 'step.'
               | 'sub.'
               | 'then.'
               | 'to.'
               | 'true.'
               | 'until.'
               | 'wend.'
               | 'while.'
               | 'with.'
               | 'xor.';

dot_id          : '.' LETTER id_tail*
               | '.' '[' id_name_char* ']'
               | '.and'
               | '.byref'
               | '.byval'
               | '.call'
               | '.case'
               | '.class'
               | '.const'
               | '.default'
               | '.dim'
               | '.do'
               | '.each'
               | '.else'
               | '.elseif'
               | '.empty'
               | '.end'
               | '.eqv'
               | '.erase'
               | '.error'
               | '.exit'
               | '.explicit'
               | '.false'
               | '.for'
               | '.function'
               | '.get'
               | '.goto'
               | '.if'
               | '.imp'
               | '.in'
               | '.is'
               | '.let'
               | '.loop'
               | '.mod'
               | '.new'
               | '.next'
               | '.not'
               | '.nothing'
               | '.null'
               | '.on'
               | '.option'
               | '.or'
               | '.preserve'
               | '.private'
               | '.property'
               | '.public'
               | '.redim'
               | '.rem' 
               | '.resume'
               | '.select'
               | '.set'
               | '.step'
               | '.sub'
               | '.then'
               | '.to'
               | '.true'
               | '.until'
               | '.wend'
               | '.while'
               | '.with'
               | '.xor';

dot_iddot      : '.' LETTER id_tail* '.'
               | '.' '[' id_name_char* ']' '.'
               | '.and.'
               | '.byref.'
               | '.byval.'
               | '.call.'
               | '.case.'
               | '.class.'
               | '.const.'
               | '.default.'
               | '.dim.'
               | '.do.'
               | '.each.'
               | '.else.'
               | '.elseif.'
               | '.empty.'
               | '.end.'
               | '.eqv.'
               | '.erase.'
               | '.error.'
               | '.exit.'
               | '.explicit.'
               | '.false.'
               | '.for.'
               | '.function.'
               | '.get.'
               | '.goto.'
               | '.if.'
               | '.imp.'
               | '.in.'
               | '.is.'
               | '.let.'
               | '.loop.'
               | '.mod.'
               | '.new.'
               | '.next.'
               | '.not.'
               | '.nothing.'
               | '.null.'
               | '.on.'
               | '.option.'
               | '.or.'
               | '.preserve.'
               | '.private.'
               | '.property.'
               | '.public.'
               | '.redim.'
               | '.rem.'
               | '.resume.'
               | '.select.'
               | '.set.'
               | '.step.'
               | '.sub.'
               | '.then.'
               | '.to.'
               | '.true.'
               | '.until.'
               | '.wend.'
               | '.while.'
               | '.with.'
               | '.xor.';

/*===== rules =====*/
new_line: (SEMI_COLON | NEW_LINE_CHARACTER)+;
program: new_line? global_stmt_list;

/*===== rules: declarations =====*/
class_decl: 'class' extended_id new_line member_decl_list 'end' 'class' new_line;
member_decl_list: member_decl*;
member_decl: field_decl | var_decl | const_decl | sub_decl | function_decl | property_decl;
field_decl: 
  'private' field_name other_vars_opt new_line 
| 'public'  field_name other_vars_opt new_line;
field_name: field_id '(' array_rank_list ')' | field_id;
field_id: id | 'default' | 'erase' | 'error' | 'explicit' | 'step';
var_decl: 'dim' var_name other_vars_opt new_line;
var_name: extended_id '(' array_rank_list ')' | extended_id;
other_vars_opt: (',' var_name other_vars_opt)?;
array_rank_list: (int_literal ',' array_rank_list | int_literal)?;
const_decl: access_modifier_opt 'const' const_list new_line;
const_list: extended_id '=' const_expr_def ',' const_list | extended_id  '=' const_expr_def;
const_expr_def: '(' const_expr_def ')' 
| '-' const_expr_def
| '+' const_expr_def
| const_expr;
sub_decl: 
  method_access_opt 'sub' extended_id method_arg_list new_line method_stmt_list 'end' 'sub' new_line
| method_access_opt 'sub' extended_id method_arg_list inline_stmt 'end' 'sub' new_line;
function_decl: 
  method_access_opt 'function' extended_id method_arg_list new_line method_stmt_list 'end' 'function' new_line
| method_access_opt 'function' extended_id method_arg_list inline_stmt 'end' 'function' new_line;
method_access_opt: 'public' 'default' | access_modifier_opt;
access_modifier_opt: ('public' | 'private')?;
method_arg_list: ('(' arg_list? ')')?;
arg_list: arg (',' arg_list)?;
arg: arg_modifier_opt extended_id ('(' ')')?;
arg_modifier_opt: ('byval' | 'byref')?;
property_decl: method_access_opt 'property' property_access_type extended_id method_arg_list new_line method_stmt_list 'end' 'property' new_line;
property_access_type: 'get' | 'let' | 'set';

/*===== rules: statements =====*/
global_stmt: option_explicit | class_decl | field_decl | const_decl | sub_decl | function_decl | block_stmt;
method_stmt: const_decl | block_stmt;
block_stmt: 
  var_decl 
| redim_stmt 
| if_stmt 
| with_stmt 
| select_stmt 
| loop_stmt 
| for_stmt 
| inline_stmt new_line;
inline_stmt: 
  assign_stmt 
| call_stmt 
| sub_call_stmt 
| error_stmt 
| exit_stmt 
| 'erase' extended_id;
global_stmt_list: global_stmt_list global_stmt | global_stmt;
method_stmt_list: method_stmt*;
block_stmt_list: block_stmt*;
option_explicit: 'option' 'explicit' new_line;
error_stmt: 'on' 'error' 'resume' 'next' | 'on' 'error' 'goto' int_literal;
exit_stmt: 'exit' 'do' | 'exit' 'for' | 'exit' 'function' | 'exit' 'property' | 'exit' 'sub';
assign_stmt: 
        left_expr '=' expr 
| 'set' left_expr '=' expr 
| 'set' left_expr '=' 'new' left_expr;
sub_call_stmt:             qualified_id sub_safe_expr? comma_expr_list
                         | qualified_id sub_safe_expr?
                         | qualified_id '(' expr ')' comma_expr_list
                         | qualified_id '(' expr ')'
                         | qualified_id '(' ')'
                         | qualified_id index_or_params_list '.' left_expr_tail sub_safe_expr? comma_expr_list
                         | qualified_id index_or_params_list_dot left_expr_tail sub_safe_expr? comma_expr_list
                         | qualified_id index_or_params_list '.' left_expr_tail sub_safe_expr?
                         | qualified_id index_or_params_list_dot left_expr_tail sub_safe_expr?;


call_stmt: 'call' left_expr;

left_expr: qualified_id index_or_params_list '.' left_expr_tail
                         | qualified_id index_or_params_list_dot left_expr_tail
                         | qualified_id index_or_params_list
                         | qualified_id
                         | safe_keyword_id;

left_expr_tail: qualified_id_tail index_or_params_list '.' left_expr_tail
                         | qualified_id_tail index_or_params_list_dot left_expr_tail
                         | qualified_id_tail index_or_params_list
                         | qualified_id_tail;

qualified_id: iddot qualified_id_tail
                         | dot_iddot qualified_id_tail
                         | id
                         | dot_id;

qualified_id_tail: iddot qualified_id_tail
                         | id
                         | keyword_id;

keyword_id: safe_keyword_id
                         | 'and'
                         | 'byref'
                         | 'byval'
                         | 'call'
                         | 'case'
                         | 'class'
                         | 'const'
                         | 'dim'
                         | 'do'
                         | 'each'
                         | 'else'
                         | 'elseif'
                         | 'empty'
                         | 'end'
                         | 'eqv'
                         | 'exit'
                         | 'false'
                         | 'for'
                         | 'function'
                         | 'get'
                         | 'goto'
                         | 'if'
                         | 'imp'
                         | 'in'
                         | 'is'
                         | 'let'
                         | 'loop'
                         | 'mod'
                         | 'new'
                         | 'next'
                         | 'not'
                         | 'nothing'
                         | 'null'
                         | 'on'
                         | 'option'
                         | 'or'
                         | 'preserve'
                         | 'private'
                         | 'public'
                         | 'redim'
                         | 'resume'
                         | 'select'
                         | 'set'
                         | 'sub'
                         | 'then'
                         | 'to'
                         | 'true'
                         | 'until'
                         | 'wend'
                         | 'while'
                         | 'with'
                         | 'xor';

safe_keyword_id: 'default'
                         | 'erase'
                         | 'error'
                         | 'explicit'
                         | 'property'
                         | 'step';

extended_id: safe_keyword_id
                         | id;

index_or_params_list: index_or_params index_or_params_list
                         | index_or_params;

index_or_params: '(' expr comma_expr_list ')'
                         | '(' comma_expr_list ')'
                         | '(' expr ')'
                         | '(' ')';

index_or_params_list_dot: index_or_params index_or_params_list_dot
                         | index_or_params_dot;

index_or_params_dot: '(' expr comma_expr_list ').'
                         | '(' comma_expr_list ').'
                         | '(' expr ').'
                         | '(' ').';

comma_expr_list: ',' expr comma_expr_list
                         | ',' comma_expr_list
                         | ',' expr
                         | ',';

/* redim statement */

redim_stmt: 'redim' redim_decl_list new_line
                         | 'redim' 'preserve' redim_decl_list new_line;

redim_decl_list: redim_decl ',' redim_decl_list
                         | redim_decl;

redim_decl: extended_id '(' expr_list ')';

/* if statement */

if_stmt: 'if' expr 'then' new_line block_stmt_list else_stmt_list 'end' 'if' new_line
                         | 'if' expr 'then' inline_stmt else_opt end_if_opt new_line;

else_stmt_list: ('elseif' expr 'then' new_line block_stmt_list else_stmt_list
                         | 'elseif' expr 'then' inline_stmt new_line else_stmt_list
                         | 'else' inline_stmt new_line
                         | 'else' new_line block_stmt_list)?;

else_opt: ('else' inline_stmt)?;
end_if_opt : ('end' 'if')?;

/* with statement */

with_stmt: 'with' expr new_line block_stmt_list 'end' 'with' new_line;

/* loop statement */

loop_stmt: 'do' loop_type expr new_line block_stmt_list 'loop' new_line
                         | 'do' new_line block_stmt_list 'loop' loop_type expr new_line
                         | 'do' new_line block_stmt_list 'loop' new_line
                         | 'while' expr new_line block_stmt_list 'wend' new_line;

loop_type: 'while' | 'until';

/* for statement */

for_stmt: 'for' extended_id '=' expr 'to' expr step_opt new_line block_stmt_list 'next' new_line
                         | 'for' 'each' extended_id 'in' expr new_line block_stmt_list 'next' new_line;

step_opt: ('step' expr)?;

/* select statement */

select_stmt: 'select' 'case' expr new_line cast_stmt_list 'end' 'select' new_line;

cast_stmt_list: ('case' expr_list nl_opt block_stmt_list cast_stmt_list
                         | 'case' 'else' nl_opt block_stmt_list)?;

nl_opt: new_line?;

expr_list: expr ',' expr_list | expr;

/*===== rules: expressions =====*/

sub_safe_expr: sub_safe_imp_expr;

sub_safe_imp_expr: sub_safe_imp_expr 'imp' eqv_expr | sub_safe_eqv_expr;

sub_safe_eqv_expr: sub_safe_eqv_expr 'eqv' xor_expr
                         | sub_safe_xor_expr;

sub_safe_xor_expr: sub_safe_xor_expr 'xor' or_expr
                         | sub_safe_or_expr;

sub_safe_or_expr: sub_safe_or_expr 'or' and_expr
                         | sub_safe_and_expr;

sub_safe_and_expr       : sub_safe_and_expr 'and' not_expr
                         | sub_safe_not_expr;

sub_safe_not_expr       : 'not' not_expr
                         | sub_safe_compare_expr;



sub_safe_compare_expr   : sub_safe_compare_expr 'is' concat_expr
                         | sub_safe_compare_expr 'is' 'not' concat_expr
                         | sub_safe_compare_expr '>=' concat_expr
                         | sub_safe_compare_expr '=>' concat_expr
                         | sub_safe_compare_expr '<=' concat_expr
                         | sub_safe_compare_expr '=<' concat_expr
                         | sub_safe_compare_expr '>'  concat_expr
                         | sub_safe_compare_expr '<'  concat_expr
                         | sub_safe_compare_expr '<>' concat_expr
                         | sub_safe_compare_expr '='  concat_expr
                         | sub_safe_concat_expr;

sub_safe_concat_expr    : sub_safe_concat_expr '&' add_expr
                         | sub_safe_add_expr;

sub_safe_add_expr       : sub_safe_add_expr '+' mod_expr
                         | sub_safe_add_expr '-' mod_expr
                         | sub_safe_mod_expr;

sub_safe_mod_expr       : sub_safe_mod_expr 'mod' int_div_expr
                         | sub_safe_int_div_expr;

sub_safe_int_div_expr    : sub_safe_int_div_expr '\' mult_expr
                         | sub_safe_mult_expr;

sub_safe_mult_expr      : sub_safe_mult_expr '*' unary_expr
                         | sub_safe_mult_expr '/' unary_expr
                         | sub_safe_unary_expr;

sub_safe_unary_expr     : '-' unary_expr
                         | '+' unary_expr
                         | sub_safe_exp_expr;

sub_safe_exp_expr       : sub_safe_value '^' exp_expr
                         | sub_safe_value;

sub_safe_value         : const_expr
                         | left_expr
                         | '(' expr ')';

expr                 : imp_expr;

imp_expr              : imp_expr 'imp' eqv_expr
                         | eqv_expr;

eqv_expr              : eqv_expr 'eqv' xor_expr
                         | xor_expr;

xor_expr              : xor_expr 'xor' or_expr
                         | or_expr;

or_expr               : or_expr 'or' and_expr
                         | and_expr;

and_expr              : and_expr 'and' not_expr
                         | not_expr;

not_expr              : 'not' not_expr
                         | compare_expr;

compare_expr          : compare_expr 'is' concat_expr
                         | compare_expr 'is' 'not' concat_expr
                         | compare_expr '>=' concat_expr
                         | compare_expr '=>' concat_expr
                         | compare_expr '<=' concat_expr
                         | compare_expr '=<' concat_expr
                         | compare_expr '>'  concat_expr
                         | compare_expr '<'  concat_expr
                         | compare_expr '<>' concat_expr
                         | compare_expr '='  concat_expr
                         | concat_expr;

concat_expr           : concat_expr '&' add_expr
                         | add_expr;

add_expr              : add_expr '+' mod_expr
                         | add_expr '-' mod_expr
                         | mod_expr;

mod_expr              : mod_expr 'mod' int_div_expr
                         | int_div_expr;

int_div_expr           : int_div_expr '\' mult_expr
                         | mult_expr;

mult_expr             : mult_expr '*' unary_expr
                         | mult_expr '/' unary_expr
                         | unary_expr;

unary_expr            : '-' unary_expr
                         | '+' unary_expr
                         | exp_expr;

exp_expr              : value '^' exp_expr
                         | value;

value                : const_expr
                         | left_expr
                         | '(' expr ')';

const_expr            : bool_literal
                         | int_literal
                         | float_literal
                         | string_literal
                         | nothing;

bool_literal          : 'true'
                         | 'false';

int_literal           : DIGIT+;

nothing              : 'nothing'
                         | 'null'
                         | 'empty';

您的语法在解析器部分定义了 "Literals"。请注意,ANTLR 将每个小写规则视为解析器规则(大写规则是词法分析器规则)。

你的小问题部分可以这样解决:

FLOAT_LITERAL
  : DIGIT* '.' DIGIT+ ( 'e' PLUS_OR_MINUS? DIGIT+ )?
  | DIGIT+ 'e' PLUS_OR_MINUS? DIGIT+;
LETTER
  : [a-z];

ANTLR 词法分析器更喜欢最长的匹配规则(如果两个规则有冲突,它更喜欢第一个定义的规则)。这两个规则是完全分离的,因此定义的顺序无关紧要(在基本规则之上定义更复杂的规则只是更具可读性)。

您可以通过大写字符扩展第二个定义:

LETTER
  : [a-zA-Z];

要解决语法的整体问题,您需要完全重写语法。 terminals 部分的大多数规则应该是词法分析器规则。然而,终端部分似乎过于拥挤,因此也可能是某些规则是不存在的解析器规则的变通方法。