如何防止百灵鸟将标识符的一部分识别为关键字？

Question

我一直在尝试使用 lark，但遇到了一个小问题。假设我有以下语法。

parser = Lark('''
    ?start: value 
            | start "or" value -> or
    ?value: DIGIT -> digit 
            | ID -> id

    DIGIT: /[1-9]\d*/

    %import common.CNAME -> ID

    %import common.WS
    %ignore WS
    ''', parser='lalr')

假设我想解析 1orfoo:

print(parser.parse("1orfoo").pretty())

我希望 lark 将其视为数字 1 后跟标识符 orfoo（因此抛出错误，因为语法不接受这种表达式）。

然而，解析器运行没有错误并输出：

or
  digit 1
  id    foo

如您所见，lark 将标识符拆分并将表达式视为 or 语句。

为什么会这样？我错过了什么吗？我怎样才能防止这种行为？

提前致谢。

Answer 1

Lark 可以使用不同的词法分析器将输入文本结构化为标记。默认值为 "auto"，它根据解析器选择词法分析器。对于 LALR，选择 "contextual" 词法分析器 (reference). The contextual lexer uses the LALR look-ahead to discard token choices that do not fit with the grammar (reference):

The contextual lexer communicates with the parser, and uses the parser's lookahead prediction to narrow its choice of tokens. So at each point, the lexer only matches the subgroup of terminals that are legal at that parser state, instead of all of the terminals. It’s surprisingly effective at resolving common terminal collisions, and allows to parse languages that LALR(1) was previously incapable of parsing.

在您的代码中，由于您使用了 lalr 解析器，因此使用了 contextual 词法分析器。词法分析器首先为 1 创建一个 DIGIT 标记。接下来，词法分析器必须决定是为 or 文字还是 ID 标记创建标记。由于解析状态不期望 ID 标记，词法分析器消除了后一个选择并标记 or.

要更改此行为，您可以 select standard 词法分析器：

parser = Lark('''...''', parser='lalr', lexer='standard')

在您的示例中，它将生成：

lark.exceptions.UnexpectedToken: Unexpected token Token(ID, 'orfoo') at line 1, column 2.
Expected one of: 
    * OR
    * $END

如何防止百灵鸟将标识符的一部分识别为关键字？

How to prevent lark from recognizing parts of an identifier as a keyword?

python

grammar

parsing

lark-parser