使用 lark 解析器(ebnf 语法)解析罗马数字时出现 UnexpectedCharacters 错误
UnexpectedCharacters Error when parsing for roman numeral using lark parser (ebnf grammar)
我正在使用 lark-parser 中的以下语法来解析字母和罗马数字。语法如下:
DIGIT: "0".."9"
INT: DIGIT+
_L_PAREN: "("
_R_PAREN: ")"
LCASE_LETTER: "a".."z"
ROMAN_NUMERALS: "viii" | "vii" | "iii" | "ii" | "ix" | "vi" | "iv" | "v" | "i" | "x"
?start: qns_num qns_alphabet qns_part
qns_num: INT?
qns_alphabet: _L_PAREN LCASE_LETTER _R_PAREN | LCASE_LETTER _R_PAREN | LCASE_LETTER?
qns_part: _L_PAREN ROMAN_NUMERALS _R_PAREN | ROMAN_NUMERALS _R_PAREN | ROMAN_NUMERALS?
当我使用此规则并解析以下文本时,出现异常:
# lark.exceptions.UnexpectedCharacters: No terminal defined for 'i' at line 1 col 5
# 10i)i)
# ^
result = Lark(grammar, parser='lalr').parse("10i)i)")
我这辈子都想不出为什么会抛出异常。但这很好:
result = Lark(grammar, parser='lalr').parse("10(i)(i)") # no error
The reason this happens, is because both rules can be empty, which
causes the lexer to always jump over one of them in order to match the
terminal with the higher priority.
With one rule empty and the second one matched, the parser expects an
EOF, not more input. The introduction of ( forces the rule to not be
empty.
So, changing the priority on LCASE_LETTER won't help. But not allowing
the first rule to be empty will.
The Earley algorithm will know how to resolve this ambiguity
automatically.
我在 lark-parser
github 页面问了同样的问题。来自 there.
的回答
我正在使用 lark-parser 中的以下语法来解析字母和罗马数字。语法如下:
DIGIT: "0".."9"
INT: DIGIT+
_L_PAREN: "("
_R_PAREN: ")"
LCASE_LETTER: "a".."z"
ROMAN_NUMERALS: "viii" | "vii" | "iii" | "ii" | "ix" | "vi" | "iv" | "v" | "i" | "x"
?start: qns_num qns_alphabet qns_part
qns_num: INT?
qns_alphabet: _L_PAREN LCASE_LETTER _R_PAREN | LCASE_LETTER _R_PAREN | LCASE_LETTER?
qns_part: _L_PAREN ROMAN_NUMERALS _R_PAREN | ROMAN_NUMERALS _R_PAREN | ROMAN_NUMERALS?
当我使用此规则并解析以下文本时,出现异常:
# lark.exceptions.UnexpectedCharacters: No terminal defined for 'i' at line 1 col 5
# 10i)i)
# ^
result = Lark(grammar, parser='lalr').parse("10i)i)")
我这辈子都想不出为什么会抛出异常。但这很好:
result = Lark(grammar, parser='lalr').parse("10(i)(i)") # no error
The reason this happens, is because both rules can be empty, which causes the lexer to always jump over one of them in order to match the terminal with the higher priority.
With one rule empty and the second one matched, the parser expects an EOF, not more input. The introduction of ( forces the rule to not be empty.
So, changing the priority on LCASE_LETTER won't help. But not allowing the first rule to be empty will.
The Earley algorithm will know how to resolve this ambiguity automatically.
我在 lark-parser
github 页面问了同样的问题。来自 there.