PLY:解析面向行的文法
PLY: parsing line-oriented grammar
我需要解析一种(相对)简单的面向行的语言(我没有发明语言本身,它是 PlantUML 图的定义语言)。
我的测试输入很简单:
@startuml
Alice -> Bob: Authentication Request
Bob --> Alice: Authentication Response
Alice -> Bob: Another authentication Request
Alice <-- Bob: another authentication Response
@enduml
出现问题是因为冒号 (':') 之后的任何内容都应被视为(可能已转义的)字符串,直到第一个 EOL('\n')完全忽略可能的内部标点符号。
注意:为了简单起见,以下只是语法的摘录,如果认为有用,我可以发布完整的测试程序。
tokens = (
'BEGIN', 'END', 'START', 'STATE', 'RARROW2', 'RARROW1', 'LARROW2', 'LARROW1',
'IDENT', 'COLON', 'NUMBER', 'BSCRIPT', 'ESCRIPT', 'ENDLINE', 'FULLINE', 'newline'
)
literals = '{:}'
t_BEGIN = r"@startuml"
t_END = r"@enduml"
t_START = r"\[\*\]"
t_RARROW2 = r"-->"
t_RARROW1 = r"->"
t_LARROW2 = r"<--"
t_LARROW1 = r"<-"
t_BSCRIPT = r"/'--"
t_ESCRIPT = r"--'/"
t_ENDLINE = r'.+'
t_FULLINE = r'^.*\n'
def t_IDENT(t):
r"""[a-zA-Z_][a-zA-Z0-9_]*"""
return t
t_ignore = " \t"
def t_newline(t):
r"""\n+"""
t.lexer.lineno += t.value.count("\n")
return t
def t_error(t):
print("Illegal character '%s'" % t.value[0])
t.lexer.skip(1)
def p_diagram(p):
"""diagram : begin diags end"""
def p_begin(p):
"""begin : BEGIN newline"""
def p_end(p):
"""end : END newline"""
def p_diags1(p):
"""diags : diag"""
def p_diags2(p):
"""diags : diags diag"""
def p_diag_t(p):
"""diag : tranc"""
def p_tranc1(p):
"""tranc : trans newline"""
def p_tranc2(p):
"""tranc : trans ':' ENDLINE newline"""
def p_transr(p):
"""trans : node rarrow node"""
def p_transl(p):
"""trans : node larrow node"""
def p_node(p):
"""node : IDENT
| START"""
def p_rarrow(p):
"""rarrow : RARROW1
| RARROW2"""
p[0] = p[1]
print("rarrow : (%s)" % p[1])
def p_larrow(p):
"""larrow : LARROW1
| LARROW2"""
我得到的第一个错误是:Syntax error at ': Authentication Request'
解析器调试输出为:
yacc.py: 360:PLY: PARSE DEBUG START
yacc.py: 408:
yacc.py: 409:State : 0
yacc.py: 433:Stack : . LexToken(BEGIN,'@startuml',1,0)
yacc.py: 443:Action : Shift and goto state 2
yacc.py: 408:
yacc.py: 409:State : 2
yacc.py: 433:Stack : BEGIN . LexToken(newline,'\n',1,9)
yacc.py: 443:Action : Shift and goto state 11
yacc.py: 408:
yacc.py: 409:State : 11
yacc.py: 433:Stack : BEGIN newline . LexToken(IDENT,'Alice',2,10)
yacc.py: 469:Action : Reduce rule [begin -> BEGIN newline] with ['@startuml','\n'] and goto state 1
yacc.py: 504:Result : <NoneType @ 0x5584868800e0> (None)
yacc.py: 408:
yacc.py: 409:State : 1
yacc.py: 433:Stack : begin . LexToken(IDENT,'Alice',2,10)
yacc.py: 443:Action : Shift and goto state 8
yacc.py: 408:
yacc.py: 409:State : 8
yacc.py: 433:Stack : begin IDENT . LexToken(RARROW1,'->',2,16)
yacc.py: 469:Action : Reduce rule [node -> IDENT] with ['Alice'] and goto state 10
yacc.py: 504:Result : <Node @ 0x7fa389dae9e8> ([[Alice]])
yacc.py: 408:
yacc.py: 409:State : 10
yacc.py: 433:Stack : begin node . LexToken(RARROW1,'->',2,16)
yacc.py: 443:Action : Shift and goto state 20
yacc.py: 408:
yacc.py: 409:State : 20
yacc.py: 433:Stack : begin node RARROW1 . LexToken(IDENT,'Bob',2,19)
yacc.py: 469:Action : Reduce rule [rarrow -> RARROW1] with ['->'] and goto state 22
yacc.py: 504:Result : <str @ 0x7fa389daea78> ('->')
yacc.py: 408:
yacc.py: 409:State : 22
yacc.py: 433:Stack : begin node rarrow . LexToken(IDENT,'Bob',2,19)
yacc.py: 443:Action : Shift and goto state 8
yacc.py: 408:
yacc.py: 409:State : 8
yacc.py: 433:Stack : begin node rarrow IDENT . LexToken(ENDLINE,': Authentication Request',2,22)
yacc.py: 578:Error : begin node rarrow IDENT . LexToken(ENDLINE,': Authentication Request',2,22)
yacc.py: 408:
yacc.py: 409:State : 8
yacc.py: 433:Stack : begin node rarrow IDENT . LexToken(newline,'\n',2,46)
yacc.py: 469:Action : Reduce rule [node -> IDENT] with ['Bob'] and goto state 26
yacc.py: 504:Result : <Node @ 0x7fa389daeb00> ([[Bob]])
yacc.py: 408:
yacc.py: 409:State : 26
yacc.py: 433:Stack : begin node rarrow node . LexToken(newline,'\n',2,46)
yacc.py: 469:Action : Reduce rule [trans -> node rarrow node] with [[[Alice]],'->',[[Bob]]] and goto state 9
yacc.py: 504:Result : <Trans @ 0x7fa389daea58> ([[Alice]] --> [[Bob]])
yacc.py: 408:
yacc.py: 409:State : 9
yacc.py: 433:Stack : begin trans . LexToken(newline,'\n',2,46)
yacc.py: 443:Action : Shift and goto state 16
yacc.py: 408:
yacc.py: 409:State : 16
yacc.py: 433:Stack : begin trans newline . LexToken(IDENT,'Bob',3,47)
yacc.py: 469:Action : Reduce rule [tranc -> trans newline] with [<Trans @ 0x7fa389daea58>,'\n'] and goto state 4
yacc.py: 504:Result : <Trans @ 0x7fa389daea58> ([[Alice]] --> [[Bob]])
yacc.py: 408:
yacc.py: 409:State : 4
yacc.py: 433:Stack : begin tranc . LexToken(IDENT,'Bob',3,47)
yacc.py: 469:Action : Reduce rule [diag -> tranc] with [<Trans @ 0x7fa389daea58>] and goto state 5
yacc.py: 504:Result : <Trans @ 0x7fa389daea58> ([[Alice]] --> [[Bob]])
yacc.py: 408:
yacc.py: 409:State : 5
yacc.py: 433:Stack : begin diag . LexToken(IDENT,'Bob',3,47)
yacc.py: 469:Action : Reduce rule [diags -> diag] with [<Trans @ 0x7fa389daea58>] and goto state 6
yacc.py: 504:Result : <list @ 0x7fa389db3ac8> ([[[Alice]] --> [[Bob]]])
yacc.py: 408:
yacc.py: 409:State : 6
yacc.py: 433:Stack : begin diags . LexToken(IDENT,'Bob',3,47)
yacc.py: 443:Action : Shift and goto state 8
yacc.py: 408:
yacc.py: 409:State : 8
yacc.py: 433:Stack : begin diags IDENT . LexToken(RARROW2,'-->',3,51)
yacc.py: 469:Action : Reduce rule [node -> IDENT] with ['Bob'] and goto state 10
yacc.py: 504:Result : <Node @ 0x7fa389daeb00> ([[Bob]])
yacc.py: 408:
yacc.py: 409:State : 10
yacc.py: 433:Stack : begin diags node . LexToken(RARROW2,'-->',3,51)
yacc.py: 443:Action : Shift and goto state 21
yacc.py: 408:
yacc.py: 409:State : 21
yacc.py: 433:Stack : begin diags node RARROW2 . LexToken(IDENT,'Alice',3,55)
yacc.py: 469:Action : Reduce rule [rarrow -> RARROW2] with ['-->'] and goto state 22
yacc.py: 504:Result : <str @ 0x7fa389daeb90> ('-->')
yacc.py: 408:
yacc.py: 409:State : 22
yacc.py: 433:Stack : begin diags node rarrow . LexToken(IDENT,'Alice',3,55)
yacc.py: 443:Action : Shift and goto state 8
yacc.py: 408:
yacc.py: 409:State : 8
yacc.py: 433:Stack : begin diags node rarrow IDENT . LexToken(ENDLINE,': Authentication Response',3,60)
yacc.py: 578:Error : begin diags node rarrow IDENT . LexToken(ENDLINE,': Authentication Response',3,60)
yacc.py: 408:
yacc.py: 409:State : 8
yacc.py: 433:Stack : begin diags node rarrow IDENT . LexToken(newline,'\n',3,85)
yacc.py: 469:Action : Reduce rule [node -> IDENT] with ['Alice'] and goto state 26
yacc.py: 504:Result : <Node @ 0x7fa389dae9e8> ([[Alice]])
yacc.py: 408:
如您所见,第二个 IDENT('Bob')
之后的标记是一个 ENDLINE(': Authentication Request')
,其中包含冒号作为第一个字符,因此使解析器完全失灵。
建议的修复方法是什么?
这个词法分析器的一点点工作是 Ply 应用词法规则的特殊顺序的结果。 [注1]
当您可以将输入分析为一系列词位时,词法分析是最简单的,其中可以在不考虑任何先前词位的情况下识别词位。这是任何标记器框架的默认模型。在该模型中,定义为“直到行尾的任何内容”的词法模式始终适用,这意味着您的输入将被分析为换行符和 rest-of-lines。这可能不是你想要的。
看起来词素实际上是“一个冒号,后面是该行的其余部分”,所以没有分隔点冒号和该行的其余部分分为两个标记。如果真是这样,那么这个模式就真的好写了:r':.*'
。 (如果冒号在其他地方用于其他目的,这将不起作用。希望它们不会。)
如果您将冒号和该行的其余部分分成两个标记,以使冒号不属于匹配标记值的一部分,那么您可以通过修改内部的 t.value
来达到相同的效果:.*
代币函数。
备注:
Ply 按以下顺序检查模式:
- 首先,令牌函数的模式按照函数在文件中定义的顺序排列
- 其次,令牌变量的模式,按长度倒序(即从最长到最短)。
由于模式 .*
比模式 :
长,它将首先尝试,因此永远不会识别冒号。 ->
在 .*
之前匹配到,我相信纯属运气。对于相同长度的图案,不应依赖按长度排列的图案。
总的来说,最好使用以下策略之一:
仅使用令牌函数并按正确顺序手动排序。
仅对明确的模式使用标记变量。
我需要解析一种(相对)简单的面向行的语言(我没有发明语言本身,它是 PlantUML 图的定义语言)。
我的测试输入很简单:
@startuml
Alice -> Bob: Authentication Request
Bob --> Alice: Authentication Response
Alice -> Bob: Another authentication Request
Alice <-- Bob: another authentication Response
@enduml
出现问题是因为冒号 (':') 之后的任何内容都应被视为(可能已转义的)字符串,直到第一个 EOL('\n')完全忽略可能的内部标点符号。
注意:为了简单起见,以下只是语法的摘录,如果认为有用,我可以发布完整的测试程序。
tokens = (
'BEGIN', 'END', 'START', 'STATE', 'RARROW2', 'RARROW1', 'LARROW2', 'LARROW1',
'IDENT', 'COLON', 'NUMBER', 'BSCRIPT', 'ESCRIPT', 'ENDLINE', 'FULLINE', 'newline'
)
literals = '{:}'
t_BEGIN = r"@startuml"
t_END = r"@enduml"
t_START = r"\[\*\]"
t_RARROW2 = r"-->"
t_RARROW1 = r"->"
t_LARROW2 = r"<--"
t_LARROW1 = r"<-"
t_BSCRIPT = r"/'--"
t_ESCRIPT = r"--'/"
t_ENDLINE = r'.+'
t_FULLINE = r'^.*\n'
def t_IDENT(t):
r"""[a-zA-Z_][a-zA-Z0-9_]*"""
return t
t_ignore = " \t"
def t_newline(t):
r"""\n+"""
t.lexer.lineno += t.value.count("\n")
return t
def t_error(t):
print("Illegal character '%s'" % t.value[0])
t.lexer.skip(1)
def p_diagram(p):
"""diagram : begin diags end"""
def p_begin(p):
"""begin : BEGIN newline"""
def p_end(p):
"""end : END newline"""
def p_diags1(p):
"""diags : diag"""
def p_diags2(p):
"""diags : diags diag"""
def p_diag_t(p):
"""diag : tranc"""
def p_tranc1(p):
"""tranc : trans newline"""
def p_tranc2(p):
"""tranc : trans ':' ENDLINE newline"""
def p_transr(p):
"""trans : node rarrow node"""
def p_transl(p):
"""trans : node larrow node"""
def p_node(p):
"""node : IDENT
| START"""
def p_rarrow(p):
"""rarrow : RARROW1
| RARROW2"""
p[0] = p[1]
print("rarrow : (%s)" % p[1])
def p_larrow(p):
"""larrow : LARROW1
| LARROW2"""
我得到的第一个错误是:Syntax error at ': Authentication Request'
解析器调试输出为:
yacc.py: 360:PLY: PARSE DEBUG START
yacc.py: 408:
yacc.py: 409:State : 0
yacc.py: 433:Stack : . LexToken(BEGIN,'@startuml',1,0)
yacc.py: 443:Action : Shift and goto state 2
yacc.py: 408:
yacc.py: 409:State : 2
yacc.py: 433:Stack : BEGIN . LexToken(newline,'\n',1,9)
yacc.py: 443:Action : Shift and goto state 11
yacc.py: 408:
yacc.py: 409:State : 11
yacc.py: 433:Stack : BEGIN newline . LexToken(IDENT,'Alice',2,10)
yacc.py: 469:Action : Reduce rule [begin -> BEGIN newline] with ['@startuml','\n'] and goto state 1
yacc.py: 504:Result : <NoneType @ 0x5584868800e0> (None)
yacc.py: 408:
yacc.py: 409:State : 1
yacc.py: 433:Stack : begin . LexToken(IDENT,'Alice',2,10)
yacc.py: 443:Action : Shift and goto state 8
yacc.py: 408:
yacc.py: 409:State : 8
yacc.py: 433:Stack : begin IDENT . LexToken(RARROW1,'->',2,16)
yacc.py: 469:Action : Reduce rule [node -> IDENT] with ['Alice'] and goto state 10
yacc.py: 504:Result : <Node @ 0x7fa389dae9e8> ([[Alice]])
yacc.py: 408:
yacc.py: 409:State : 10
yacc.py: 433:Stack : begin node . LexToken(RARROW1,'->',2,16)
yacc.py: 443:Action : Shift and goto state 20
yacc.py: 408:
yacc.py: 409:State : 20
yacc.py: 433:Stack : begin node RARROW1 . LexToken(IDENT,'Bob',2,19)
yacc.py: 469:Action : Reduce rule [rarrow -> RARROW1] with ['->'] and goto state 22
yacc.py: 504:Result : <str @ 0x7fa389daea78> ('->')
yacc.py: 408:
yacc.py: 409:State : 22
yacc.py: 433:Stack : begin node rarrow . LexToken(IDENT,'Bob',2,19)
yacc.py: 443:Action : Shift and goto state 8
yacc.py: 408:
yacc.py: 409:State : 8
yacc.py: 433:Stack : begin node rarrow IDENT . LexToken(ENDLINE,': Authentication Request',2,22)
yacc.py: 578:Error : begin node rarrow IDENT . LexToken(ENDLINE,': Authentication Request',2,22)
yacc.py: 408:
yacc.py: 409:State : 8
yacc.py: 433:Stack : begin node rarrow IDENT . LexToken(newline,'\n',2,46)
yacc.py: 469:Action : Reduce rule [node -> IDENT] with ['Bob'] and goto state 26
yacc.py: 504:Result : <Node @ 0x7fa389daeb00> ([[Bob]])
yacc.py: 408:
yacc.py: 409:State : 26
yacc.py: 433:Stack : begin node rarrow node . LexToken(newline,'\n',2,46)
yacc.py: 469:Action : Reduce rule [trans -> node rarrow node] with [[[Alice]],'->',[[Bob]]] and goto state 9
yacc.py: 504:Result : <Trans @ 0x7fa389daea58> ([[Alice]] --> [[Bob]])
yacc.py: 408:
yacc.py: 409:State : 9
yacc.py: 433:Stack : begin trans . LexToken(newline,'\n',2,46)
yacc.py: 443:Action : Shift and goto state 16
yacc.py: 408:
yacc.py: 409:State : 16
yacc.py: 433:Stack : begin trans newline . LexToken(IDENT,'Bob',3,47)
yacc.py: 469:Action : Reduce rule [tranc -> trans newline] with [<Trans @ 0x7fa389daea58>,'\n'] and goto state 4
yacc.py: 504:Result : <Trans @ 0x7fa389daea58> ([[Alice]] --> [[Bob]])
yacc.py: 408:
yacc.py: 409:State : 4
yacc.py: 433:Stack : begin tranc . LexToken(IDENT,'Bob',3,47)
yacc.py: 469:Action : Reduce rule [diag -> tranc] with [<Trans @ 0x7fa389daea58>] and goto state 5
yacc.py: 504:Result : <Trans @ 0x7fa389daea58> ([[Alice]] --> [[Bob]])
yacc.py: 408:
yacc.py: 409:State : 5
yacc.py: 433:Stack : begin diag . LexToken(IDENT,'Bob',3,47)
yacc.py: 469:Action : Reduce rule [diags -> diag] with [<Trans @ 0x7fa389daea58>] and goto state 6
yacc.py: 504:Result : <list @ 0x7fa389db3ac8> ([[[Alice]] --> [[Bob]]])
yacc.py: 408:
yacc.py: 409:State : 6
yacc.py: 433:Stack : begin diags . LexToken(IDENT,'Bob',3,47)
yacc.py: 443:Action : Shift and goto state 8
yacc.py: 408:
yacc.py: 409:State : 8
yacc.py: 433:Stack : begin diags IDENT . LexToken(RARROW2,'-->',3,51)
yacc.py: 469:Action : Reduce rule [node -> IDENT] with ['Bob'] and goto state 10
yacc.py: 504:Result : <Node @ 0x7fa389daeb00> ([[Bob]])
yacc.py: 408:
yacc.py: 409:State : 10
yacc.py: 433:Stack : begin diags node . LexToken(RARROW2,'-->',3,51)
yacc.py: 443:Action : Shift and goto state 21
yacc.py: 408:
yacc.py: 409:State : 21
yacc.py: 433:Stack : begin diags node RARROW2 . LexToken(IDENT,'Alice',3,55)
yacc.py: 469:Action : Reduce rule [rarrow -> RARROW2] with ['-->'] and goto state 22
yacc.py: 504:Result : <str @ 0x7fa389daeb90> ('-->')
yacc.py: 408:
yacc.py: 409:State : 22
yacc.py: 433:Stack : begin diags node rarrow . LexToken(IDENT,'Alice',3,55)
yacc.py: 443:Action : Shift and goto state 8
yacc.py: 408:
yacc.py: 409:State : 8
yacc.py: 433:Stack : begin diags node rarrow IDENT . LexToken(ENDLINE,': Authentication Response',3,60)
yacc.py: 578:Error : begin diags node rarrow IDENT . LexToken(ENDLINE,': Authentication Response',3,60)
yacc.py: 408:
yacc.py: 409:State : 8
yacc.py: 433:Stack : begin diags node rarrow IDENT . LexToken(newline,'\n',3,85)
yacc.py: 469:Action : Reduce rule [node -> IDENT] with ['Alice'] and goto state 26
yacc.py: 504:Result : <Node @ 0x7fa389dae9e8> ([[Alice]])
yacc.py: 408:
如您所见,第二个 IDENT('Bob')
之后的标记是一个 ENDLINE(': Authentication Request')
,其中包含冒号作为第一个字符,因此使解析器完全失灵。
建议的修复方法是什么?
这个词法分析器的一点点工作是 Ply 应用词法规则的特殊顺序的结果。 [注1]
当您可以将输入分析为一系列词位时,词法分析是最简单的,其中可以在不考虑任何先前词位的情况下识别词位。这是任何标记器框架的默认模型。在该模型中,定义为“直到行尾的任何内容”的词法模式始终适用,这意味着您的输入将被分析为换行符和 rest-of-lines。这可能不是你想要的。
看起来词素实际上是“一个冒号,后面是该行的其余部分”,所以没有分隔点冒号和该行的其余部分分为两个标记。如果真是这样,那么这个模式就真的好写了:r':.*'
。 (如果冒号在其他地方用于其他目的,这将不起作用。希望它们不会。)
如果您将冒号和该行的其余部分分成两个标记,以使冒号不属于匹配标记值的一部分,那么您可以通过修改内部的 t.value
来达到相同的效果:.*
代币函数。
备注:
Ply 按以下顺序检查模式:
- 首先,令牌函数的模式按照函数在文件中定义的顺序排列
- 其次,令牌变量的模式,按长度倒序(即从最长到最短)。
由于模式
.*
比模式:
长,它将首先尝试,因此永远不会识别冒号。->
在.*
之前匹配到,我相信纯属运气。对于相同长度的图案,不应依赖按长度排列的图案。总的来说,最好使用以下策略之一:
仅使用令牌函数并按正确顺序手动排序。
仅对明确的模式使用标记变量。