如何解析语法`(a | b)* a`

How to parse grammar `(a | b)* a`

让我们考虑以下 Backus-Naur 形式描述的语法:

a ::= 'a'
b ::= 'b'
grammar ::= (a | b)* a

我正在尝试使用 pyparsing 对其进行解析并得出以下实现

a = Literal('a')
b = Literal('b')
grammar = (a | b)[...] + 'a'

但是,它无法解析语法描述的任何字符串,例如 grammar.parseString('aba') raising

ParseException: Expected "a", found end of text  (at char 3), (line:1, col:4)

这似乎是由于 [...] 表达式通过消耗令牌来解析,直到无法再执行为止。那么,最后一个文字就没有要解析的标记了。

一种方法是使用 FollowedBy class:

grammar = ((a | b) + FollowedBy(a | b))[...] + a

有效。然而,它非常不优雅,似乎效率不高,而且不是真正的通用。

有没有更好的方法用 pyparsing 解析这个语法?

不,你是完全正确的,pyparsing 不会像 "[ab]*a" 这样的正则表达式那样进行回溯。除非您明确实施,否则 Pyparsing 不会进行任何形式的前瞻。

这是原始解析器的扩展版本,添加了 setNamesetDebug 调用:

a = Literal('a').setName("A").setDebug()
b = Literal('b').setName("B").setDebug()
grammar = (a | b)[...] + a().setName("trailing_a").setDebug()

grammar.runTests("""\
    aba
    """)

解析“aba”时,调试输出如下:

Match A at loc 0(1,1)
Matched A -> ['a']
Match A at loc 1(1,2)
Exception raised:Expected A, found 'b'  (at char 1), (line:1, col:2)
Match B at loc 1(1,2)
Matched B -> ['b']
Match A at loc 2(1,3)
Matched A -> ['a']
Match A at loc 3(1,4)
Exception raised:Expected A, found end of text  (at char 3), (line:1, col:4)
Match B at loc 3(1,4)
Exception raised:Expected B, found end of text  (at char 3), (line:1, col:4)
Match trailing_a at loc 3(1,4)
Exception raised:Expected trailing_a, found end of text  (at char 3), (line:1, col:4)

aba
   ^
FAIL: Expected trailing_a, found end of text  (at char 3), (line:1, col:4

您可以看到 trailing_a 作为初始重复的一部分得到匹配,而不是作为 trailing_a。由于现在没有实际的尾随 'a',因此解析失败。

您可以为前导重复定义一种特殊形式的 a(就像您在一行中所做的那样),就像在两行中这样:

leading_a = a + FollowedBy(a | b)
grammar = (leading_a | b)[...] + 'a'

通过调试输出,我们可以遵循解析器逻辑:

Match leading_a at loc 0(1,1)
Match A at loc 0(1,1)
Matched A -> ['a']
Match A at loc 1(1,2)
Exception raised:Expected A, found 'b'  (at char 1), (line:1, col:2)
Match B at loc 1(1,2)
Matched B -> ['b']
Matched leading_a -> ['a']
Match leading_a at loc 1(1,2)
Match A at loc 1(1,2)
Exception raised:Expected A, found 'b'  (at char 1), (line:1, col:2)
Exception raised:Expected A, found 'b'  (at char 1), (line:1, col:2)
Match B at loc 1(1,2)
Matched B -> ['b']
Match leading_a at loc 2(1,3)
Match A at loc 2(1,3)
Matched A -> ['a']
Match A at loc 3(1,4)
Exception raised:Expected A, found end of text  (at char 3), (line:1, col:4)
Match B at loc 3(1,4)
Exception raised:Expected B, found end of text  (at char 3), (line:1, col:4)
Exception raised:Expected {A | B}, found end of text  (at char 3), (line:1, col:4)
Match B at loc 2(1,3)
Exception raised:Expected B, found 'a'  (at char 2), (line:1, col:3)

aba
['a', 'b', 'a']

或者定义一个特殊的trailing_a,并使用stopOn参数给ZeroOrMore:

trailing_a = a + ~FollowedBy(a | b)
grammar = OneOrMore(a | b, stopOn=trailing_a) + 'a'

得到相似的结果。

编辑 将语法更改为 (a | b)[...] 会显示此调试输出:

Match A at loc 0(1,1)
Matched A -> ['a']
Match A at loc 1(1,2)
Exception raised:Expected A, found 'b'  (at char 1), (line:1, col:2)
Match B at loc 1(1,2)
Matched B -> ['b']
Match A at loc 2(1,3)
Matched A -> ['a']
Match A at loc 3(1,4)
Exception raised:Expected A, found end of text  (at char 3), (line:1, col:4)
Match B at loc 3(1,4)
Exception raised:Expected B, found end of text  (at char 3), (line:1, col:4)

aba
['a', 'b', 'a']

所以,是的,前瞻确实会导致性能下降。

pyparsing 包括内部缓存功能,也称为“packrat 解析”。这是调试输出,缓存值标有“*”:

  Match trailing_a at loc 0(1,1)
  Match A at loc 0(1,1)
  Matched A -> ['a']
  Match A at loc 1(1,2)
  Exception raised:Expected A, found 'b'  (at char 1), (line:1, col:2)
  Match B at loc 1(1,2)
  Matched B -> ['b']
  Exception raised:Found unwanted token, FollowedBy:({A | B}), found 'b'  (at char 1), (line:1, col:2)
* Match A at loc 0(1,1)
* Matched A -> ['a']
  Match trailing_a at loc 1(1,2)
* Match A at loc 1(1,2)
* Exception raised:Expected A, found 'b'  (at char 1), (line:1, col:2)
* Exception raised:Expected A, found 'b'  (at char 1), (line:1, col:2)
* Match A at loc 1(1,2)
* Exception raised:Expected A, found 'b'  (at char 1), (line:1, col:2)
* Match B at loc 1(1,2)
* Matched B -> ['b']
  Match trailing_a at loc 2(1,3)
  Match A at loc 2(1,3)
* Matched A -> ['a']
  Match A at loc 3(1,4)
  Exception raised:Expected A, found end of text  (at char 3), (line:1, col:4)
  Match B at loc 3(1,4)
  Exception raised:Expected B, found end of text  (at char 3), (line:1, col:4)
  Matched trailing_a -> ['a']
* Match trailing_a at loc 2(1,3)
* Match A at loc 2(1,3)
* Matched A -> ['a']
* Matched trailing_a -> ['a']

“匹配”操作总结:

  • 没有前瞻:6
  • 前瞻:12
  • 使用 lookahead+packrat:9

最后一点,可以使用解析操作在解析时注入额外的逻辑。解析动作可以写成一个方法(可以 return 修改后的标记集,或引发异常)或作为谓词函数(returns True 或 False,pyparsing 将引发异常如果出现 False return).

因此您可以使用最快的 no-lookahead 形式编写语法,然后添加验证条件为 运行:

grammar = (a | b)[...]
grammar.addCondition(lambda t: t[-1] == 'a', message="string does not end with 'a'")

很可能节省的解析时间足以抵消进行单独条件评估的额外成本。