lark-parser 缩进 DSL 和多行文档字符串
lark-parser indented DSL and multiline documentation strings
我正在尝试使用 lark 实现记录定义 DSL。它基于缩进,这让事情变得有点复杂。
Lark 是一个很好的工具,但是我遇到了一些困难。
这是我正在实施的 DSL 的一个片段:
record Order :
"""Order record documentation
should have arbitrary size"""
field1 Int
field2 Datetime:
"""Attributes should also have
multiline documentation"""
field3 String "inline documentation also works"
这里是使用的语法:
?start: (_NEWLINE | redorddef)*
simple_type: NAME
multiline_doc: MULTILINE_STRING _NEWLINE
inline_doc: INLINE_STRING
?element_doc: ":" _NEWLINE _INDENT multiline_doc _DEDENT | inline_doc
attribute_name: NAME
attribute_simple_type: attribute_name simple_type [element_doc] _NEWLINE
attributes: attribute_simple_type+
_recordbody: _NEWLINE _INDENT [multiline_doc] attributes _DEDENT
redorddef: "record" NAME ":" _recordbody
MULTILINE_STRING: /"""([^"\]*(\.[^"\]*)*)"""/
INLINE_STRING: /"([^"\]*(\.[^"\]*)*)"/
_WS_INLINE: (" "|/\t/)+
COMMENT: /#[^\n]*/
_NEWLINE: ( /\r?\n[\t ]*/ | COMMENT )+
%import common.CNAME -> NAME
%import common.INT
%ignore /[\t \f]+/ // WS
%ignore /\[\t \f]*\r?\n/ // LINE_CONT
%ignore COMMENT
%declare _INDENT _DEDENT
它适用于记录定义的多行字符串文档,适用于内联属性定义,但不适用于属性多行字符串文档。
我用来执行的代码是这样的:
import sys
import pprint
from pathlib import Path
from lark import Lark, UnexpectedInput
from lark.indenter import Indenter
scheman_data_works = '''
record Order :
"""Order record documentation
should have arbitrary size"""
field1 Int
# field2 Datetime:
# """Attributes should also have
# multiline documentation"""
field3 String "inline documentation also works"
'''
scheman_data_wrong = '''
record Order :
"""Order record documentation
should have arbitrary size"""
field1 Int
field2 Datetime:
"""Attributes should also have
multiline documentation"""
field3 String "inline documentation also works"
'''
grammar = r'''
?start: (_NEWLINE | redorddef)*
simple_type: NAME
multiline_doc: MULTILINE_STRING _NEWLINE
inline_doc: INLINE_STRING
?element_doc: ":" _NEWLINE _INDENT multiline_doc _DEDENT | inline_doc
attribute_name: NAME
attribute_simple_type: attribute_name simple_type [element_doc] _NEWLINE
attributes: attribute_simple_type+
_recordbody: _NEWLINE _INDENT [multiline_doc] attributes _DEDENT
redorddef: "record" NAME ":" _recordbody
MULTILINE_STRING: /"""([^"\]*(\.[^"\]*)*)"""/
INLINE_STRING: /"([^"\]*(\.[^"\]*)*)"/
_WS_INLINE: (" "|/\t/)+
COMMENT: /#[^\n]*/
_NEWLINE: ( /\r?\n[\t ]*/ | COMMENT )+
%import common.CNAME -> NAME
%import common.INT
%ignore /[\t \f]+/ // WS
%ignore /\[\t \f]*\r?\n/ // LINE_CONT
%ignore COMMENT
%declare _INDENT _DEDENT
'''
class SchemanIndenter(Indenter):
NL_type = '_NEWLINE'
OPEN_PAREN_types = ['LPAR', 'LSQB', 'LBRACE']
CLOSE_PAREN_types = ['RPAR', 'RSQB', 'RBRACE']
INDENT_type = '_INDENT'
DEDENT_type = '_DEDENT'
tab_len = 4
scheman_parser = Lark(grammar, parser='lalr', postlex=SchemanIndenter())
print(scheman_parser.parse(scheman_data_works).pretty())
print("\n\n")
print(scheman_parser.parse(scheman_data_wrong).pretty())
结果是:
redorddef
Order
multiline_doc """Order record documentation
should have arbitrary size"""
attributes
attribute_simple_type
attribute_name field1
simple_type Int
attribute_simple_type
attribute_name field3
simple_type String
inline_doc "inline documentation also works"
Traceback (most recent call last):
File "schema_parser.py", line 83, in <module>
print(scheman_parser.parse(scheman_data_wrong).pretty())
File "/Users/branquif/Dropbox/swf_projects/schema-manager/.venv/lib/python3.7/site-packages/lark/lark.py", line 228, in parse
return self.parser.parse(text)
File "/Users/branquif/Dropbox/swf_projects/schema-manager/.venv/lib/python3.7/site-packages/lark/parser_frontends.py", line 38, in parse
return self.parser.parse(token_stream, *[sps] if sps is not NotImplemented else [])
File "/Users/branquif/Dropbox/swf_projects/schema-manager/.venv/lib/python3.7/site-packages/lark/parsers/lalr_parser.py", line 68, in parse
for token in stream:
File "/Users/branquif/Dropbox/swf_projects/schema-manager/.venv/lib/python3.7/site-packages/lark/indenter.py", line 31, in process
for token in stream:
File "/Users/branquif/Dropbox/swf_projects/schema-manager/.venv/lib/python3.7/site-packages/lark/lexer.py", line 319, in lex
for x in l.lex(stream, self.root_lexer.newline_types, self.root_lexer.ignore_types):
File "/Users/branquif/Dropbox/swf_projects/schema-manager/.venv/lib/python3.7/site-packages/lark/lexer.py", line 167, in lex
raise UnexpectedCharacters(stream, line_ctr.char_pos, line_ctr.line, line_ctr.column, state=self.state)
lark.exceptions.UnexpectedCharacters: No terminal defined for 'f' at line 11 col 2
field3 String "inline documentation also
^
我不明白缩进的语法更复杂,lark似乎更容易,但找不到这里的错误。
PS:我也尝试过 pyparsing,但在同样的情况下没有成功,考虑到可能需要的代码量,我很难转向 PLY。
错误来自错误放置的 _NEWLINE 终端。通常,建议确保规则在语法中的作用方面是平衡的。所以这里是你应该如何定义 element_doc
:
?element_doc: ":" _NEWLINE _INDENT multiline_doc _DEDENT
| inline_doc _NEWLINE
注意添加的换行符,这意味着无论解析器采用两个选项中的哪一个,它都会以相似的状态结束,语法方面(_DEDENT
也匹配换行符)。
第二个变化是第一个变化的结果:
attribute_simple_type: attribute_name simple_type (element_doc|_NEWLINE)
由于 element_doc
已经处理换行符,我们不应该尝试匹配它两次。
你提到尝试 pyparsing,否则我会单独留下你的问题。
空格敏感的解析对于 pyparsing 来说不是很好,但它确实在这种情况下做出了努力,使用 pyparsing.indentedBlock
。写到这里有一定的苦恼,但是可以做到。
import pyparsing as pp
COLON = pp.Suppress(':')
tpl_quoted_string = pp.QuotedString('"""', multiline=True) | pp.QuotedString("'''", multiline=True)
quoted_string = pp.ungroup(tpl_quoted_string | pp.quotedString().addParseAction(pp.removeQuotes))
RECORD = pp.Keyword("record")
ident = pp.pyparsing_common.identifier()
field_expr = (ident("name")
+ ident("type") + pp.Optional(COLON)
+ pp.Optional(quoted_string)("docstring"))
indent_stack = []
STACK_RESET = pp.Empty()
def reset_indent_stack(s, l, t):
indent_stack[:] = [pp.col(l, s)]
STACK_RESET.addParseAction(reset_indent_stack)
record_expr = pp.Group(STACK_RESET
+ RECORD - ident("name") + COLON + pp.Optional(quoted_string)("docstring")
+ (pp.indentedBlock(field_expr, indent_stack))("fields"))
record_expr.ignore(pp.pythonStyleComment)
如果您的示例写入变量 'sample',请执行:
print(record_expr.parseString(sample).dump())
并得到:
[['record', 'Order', 'Order record documentation\n should have arbitrary size', [['field1', 'Int'], ['field2', 'Datetime', 'Attributes should also have\n multiline documentation'], ['field3', 'String', 'inline documentation also works']]]]
[0]:
['record', 'Order', 'Order record documentation\n should have arbitrary size', [['field1', 'Int'], ['field2', 'Datetime', 'Attributes should also have\n multiline documentation'], ['field3', 'String', 'inline documentation also works']]]
- docstring: 'Order record documentation\n should have arbitrary size'
- fields: [['field1', 'Int'], ['field2', 'Datetime', 'Attributes should also have\n multiline documentation'], ['field3', 'String', 'inline documentation also works']]
[0]:
['field1', 'Int']
- name: 'field1'
- type: 'Int'
[1]:
['field2', 'Datetime', 'Attributes should also have\n multiline documentation']
- docstring: 'Attributes should also have\n multiline documentation'
- name: 'field2'
- type: 'Datetime'
[2]:
['field3', 'String', 'inline documentation also works']
- docstring: 'inline documentation also works'
- name: 'field3'
- type: 'String'
- name: 'Order'
我正在尝试使用 lark 实现记录定义 DSL。它基于缩进,这让事情变得有点复杂。
Lark 是一个很好的工具,但是我遇到了一些困难。
这是我正在实施的 DSL 的一个片段:
record Order :
"""Order record documentation
should have arbitrary size"""
field1 Int
field2 Datetime:
"""Attributes should also have
multiline documentation"""
field3 String "inline documentation also works"
这里是使用的语法:
?start: (_NEWLINE | redorddef)*
simple_type: NAME
multiline_doc: MULTILINE_STRING _NEWLINE
inline_doc: INLINE_STRING
?element_doc: ":" _NEWLINE _INDENT multiline_doc _DEDENT | inline_doc
attribute_name: NAME
attribute_simple_type: attribute_name simple_type [element_doc] _NEWLINE
attributes: attribute_simple_type+
_recordbody: _NEWLINE _INDENT [multiline_doc] attributes _DEDENT
redorddef: "record" NAME ":" _recordbody
MULTILINE_STRING: /"""([^"\]*(\.[^"\]*)*)"""/
INLINE_STRING: /"([^"\]*(\.[^"\]*)*)"/
_WS_INLINE: (" "|/\t/)+
COMMENT: /#[^\n]*/
_NEWLINE: ( /\r?\n[\t ]*/ | COMMENT )+
%import common.CNAME -> NAME
%import common.INT
%ignore /[\t \f]+/ // WS
%ignore /\[\t \f]*\r?\n/ // LINE_CONT
%ignore COMMENT
%declare _INDENT _DEDENT
它适用于记录定义的多行字符串文档,适用于内联属性定义,但不适用于属性多行字符串文档。
我用来执行的代码是这样的:
import sys
import pprint
from pathlib import Path
from lark import Lark, UnexpectedInput
from lark.indenter import Indenter
scheman_data_works = '''
record Order :
"""Order record documentation
should have arbitrary size"""
field1 Int
# field2 Datetime:
# """Attributes should also have
# multiline documentation"""
field3 String "inline documentation also works"
'''
scheman_data_wrong = '''
record Order :
"""Order record documentation
should have arbitrary size"""
field1 Int
field2 Datetime:
"""Attributes should also have
multiline documentation"""
field3 String "inline documentation also works"
'''
grammar = r'''
?start: (_NEWLINE | redorddef)*
simple_type: NAME
multiline_doc: MULTILINE_STRING _NEWLINE
inline_doc: INLINE_STRING
?element_doc: ":" _NEWLINE _INDENT multiline_doc _DEDENT | inline_doc
attribute_name: NAME
attribute_simple_type: attribute_name simple_type [element_doc] _NEWLINE
attributes: attribute_simple_type+
_recordbody: _NEWLINE _INDENT [multiline_doc] attributes _DEDENT
redorddef: "record" NAME ":" _recordbody
MULTILINE_STRING: /"""([^"\]*(\.[^"\]*)*)"""/
INLINE_STRING: /"([^"\]*(\.[^"\]*)*)"/
_WS_INLINE: (" "|/\t/)+
COMMENT: /#[^\n]*/
_NEWLINE: ( /\r?\n[\t ]*/ | COMMENT )+
%import common.CNAME -> NAME
%import common.INT
%ignore /[\t \f]+/ // WS
%ignore /\[\t \f]*\r?\n/ // LINE_CONT
%ignore COMMENT
%declare _INDENT _DEDENT
'''
class SchemanIndenter(Indenter):
NL_type = '_NEWLINE'
OPEN_PAREN_types = ['LPAR', 'LSQB', 'LBRACE']
CLOSE_PAREN_types = ['RPAR', 'RSQB', 'RBRACE']
INDENT_type = '_INDENT'
DEDENT_type = '_DEDENT'
tab_len = 4
scheman_parser = Lark(grammar, parser='lalr', postlex=SchemanIndenter())
print(scheman_parser.parse(scheman_data_works).pretty())
print("\n\n")
print(scheman_parser.parse(scheman_data_wrong).pretty())
结果是:
redorddef
Order
multiline_doc """Order record documentation
should have arbitrary size"""
attributes
attribute_simple_type
attribute_name field1
simple_type Int
attribute_simple_type
attribute_name field3
simple_type String
inline_doc "inline documentation also works"
Traceback (most recent call last):
File "schema_parser.py", line 83, in <module>
print(scheman_parser.parse(scheman_data_wrong).pretty())
File "/Users/branquif/Dropbox/swf_projects/schema-manager/.venv/lib/python3.7/site-packages/lark/lark.py", line 228, in parse
return self.parser.parse(text)
File "/Users/branquif/Dropbox/swf_projects/schema-manager/.venv/lib/python3.7/site-packages/lark/parser_frontends.py", line 38, in parse
return self.parser.parse(token_stream, *[sps] if sps is not NotImplemented else [])
File "/Users/branquif/Dropbox/swf_projects/schema-manager/.venv/lib/python3.7/site-packages/lark/parsers/lalr_parser.py", line 68, in parse
for token in stream:
File "/Users/branquif/Dropbox/swf_projects/schema-manager/.venv/lib/python3.7/site-packages/lark/indenter.py", line 31, in process
for token in stream:
File "/Users/branquif/Dropbox/swf_projects/schema-manager/.venv/lib/python3.7/site-packages/lark/lexer.py", line 319, in lex
for x in l.lex(stream, self.root_lexer.newline_types, self.root_lexer.ignore_types):
File "/Users/branquif/Dropbox/swf_projects/schema-manager/.venv/lib/python3.7/site-packages/lark/lexer.py", line 167, in lex
raise UnexpectedCharacters(stream, line_ctr.char_pos, line_ctr.line, line_ctr.column, state=self.state)
lark.exceptions.UnexpectedCharacters: No terminal defined for 'f' at line 11 col 2
field3 String "inline documentation also
^
我不明白缩进的语法更复杂,lark似乎更容易,但找不到这里的错误。
PS:我也尝试过 pyparsing,但在同样的情况下没有成功,考虑到可能需要的代码量,我很难转向 PLY。
错误来自错误放置的 _NEWLINE 终端。通常,建议确保规则在语法中的作用方面是平衡的。所以这里是你应该如何定义 element_doc
:
?element_doc: ":" _NEWLINE _INDENT multiline_doc _DEDENT
| inline_doc _NEWLINE
注意添加的换行符,这意味着无论解析器采用两个选项中的哪一个,它都会以相似的状态结束,语法方面(_DEDENT
也匹配换行符)。
第二个变化是第一个变化的结果:
attribute_simple_type: attribute_name simple_type (element_doc|_NEWLINE)
由于 element_doc
已经处理换行符,我们不应该尝试匹配它两次。
你提到尝试 pyparsing,否则我会单独留下你的问题。
空格敏感的解析对于 pyparsing 来说不是很好,但它确实在这种情况下做出了努力,使用 pyparsing.indentedBlock
。写到这里有一定的苦恼,但是可以做到。
import pyparsing as pp
COLON = pp.Suppress(':')
tpl_quoted_string = pp.QuotedString('"""', multiline=True) | pp.QuotedString("'''", multiline=True)
quoted_string = pp.ungroup(tpl_quoted_string | pp.quotedString().addParseAction(pp.removeQuotes))
RECORD = pp.Keyword("record")
ident = pp.pyparsing_common.identifier()
field_expr = (ident("name")
+ ident("type") + pp.Optional(COLON)
+ pp.Optional(quoted_string)("docstring"))
indent_stack = []
STACK_RESET = pp.Empty()
def reset_indent_stack(s, l, t):
indent_stack[:] = [pp.col(l, s)]
STACK_RESET.addParseAction(reset_indent_stack)
record_expr = pp.Group(STACK_RESET
+ RECORD - ident("name") + COLON + pp.Optional(quoted_string)("docstring")
+ (pp.indentedBlock(field_expr, indent_stack))("fields"))
record_expr.ignore(pp.pythonStyleComment)
如果您的示例写入变量 'sample',请执行:
print(record_expr.parseString(sample).dump())
并得到:
[['record', 'Order', 'Order record documentation\n should have arbitrary size', [['field1', 'Int'], ['field2', 'Datetime', 'Attributes should also have\n multiline documentation'], ['field3', 'String', 'inline documentation also works']]]]
[0]:
['record', 'Order', 'Order record documentation\n should have arbitrary size', [['field1', 'Int'], ['field2', 'Datetime', 'Attributes should also have\n multiline documentation'], ['field3', 'String', 'inline documentation also works']]]
- docstring: 'Order record documentation\n should have arbitrary size'
- fields: [['field1', 'Int'], ['field2', 'Datetime', 'Attributes should also have\n multiline documentation'], ['field3', 'String', 'inline documentation also works']]
[0]:
['field1', 'Int']
- name: 'field1'
- type: 'Int'
[1]:
['field2', 'Datetime', 'Attributes should also have\n multiline documentation']
- docstring: 'Attributes should also have\n multiline documentation'
- name: 'field2'
- type: 'Datetime'
[2]:
['field3', 'String', 'inline documentation also works']
- docstring: 'inline documentation also works'
- name: 'field3'
- type: 'String'
- name: 'Order'