节点深度编码为星数
Node depth encoded as number of stars
这种语言的文档看起来像
* A top-level Headline
Some text about that headline.
** Sub-Topic 1
Text about the sub-topic 1.
*** Sub-sub-topic
More text here about the sub-sub-topic
** Sub-Topic 2
Extra text here about sub-topic 2
*** Other Sub-sub-topic
More text here about the other sub-sub-topic
深度级别的数量是无限的。我想知道如何获得一个能够适当构建嵌套树的解析器。我一直在寻找 indenter example 的灵感,但我一直没有弄明白。
该问题需要上下文相关的语法,因此我们使用您链接的压头示例中的解决方法:
我们编写了一个自定义 postlex
处理器,用于保存观察到的缩进级别的堆栈。当读取到一个star token (*
, **
, ***
, ...) 时,弹出堆栈,直到堆栈上的缩进级别更小,然后压入新级别在堆栈上。对于每个 push/pop,相应的 INDENT/DEDENT 辅助令牌被注入到令牌流中。然后可以在语法中使用这些辅助标记来获得反映嵌套级别的解析树。
from lark import Lark, Token
tree_grammar = r"""
start: NEWLINE* item*
item: STARS nest
nest: _INDENT (nest | LINE+ item*) _DEDENT
STARS.2: /\*+/
LINE.1: /.*/ NEWLINE
%declare _INDENT _DEDENT
%import common.NEWLINE
"""
class StarIndenter():
STARS_type = 'STARS'
INDENT_type = '_INDENT'
DEDENT_type = '_DEDENT'
def dedent(self, level, token):
""" When the given level leaves the current nesting of the stack,
inject corresponding number of DEDENT tokens into the stream.
"""
while level <= self.indent[-1]:
pop_level = self.indent.pop()
pop_diff = pop_level - self.indent[-1]
for _ in range(pop_diff):
yield token
def handle_stars(self, token):
""" Handle tokens of the form '*', '**', '***', ...
"""
level = len(token.value)
dedent_token = Token.new_borrow_pos(self.DEDENT_type, '', token)
yield from self.dedent(level, dedent_token)
diff = level-self.indent[-1]
self.indent.append(level)
# Put star token into stream
yield token
indent_token = Token.new_borrow_pos(self.INDENT_type, '', token)
for _ in range(diff):
yield indent_token
def process(self, stream):
self.indent = [0]
# Process token stream
for token in stream:
if token.type == self.STARS_type:
yield from self.handle_stars(token)
else:
yield token
# Inject closing dedent tokens
yield from self.dedent(1, Token(self.DEDENT_type, ''))
# No idea why this is needed
@property
def always_accept(self):
return ()
parser = Lark(tree_grammar, parser='lalr', postlex=StarIndenter())
注意 STARS
终端被分配了比 LINES
更高的优先级(通过 .2
对比 .1
),以防止 LINES+
吃掉以星号开头的行。
使用示例的简化版本:
test_tree = """
* A
** AA
*** AAA
** AB
*** ABA
"""
print(parser.parse(test_tree).pretty())
结果:
start
item
*
nest
A
item
**
nest
AA
item
***
nest AAA
item
**
nest
AB
item
***
nest ABA
这种语言的文档看起来像
* A top-level Headline
Some text about that headline.
** Sub-Topic 1
Text about the sub-topic 1.
*** Sub-sub-topic
More text here about the sub-sub-topic
** Sub-Topic 2
Extra text here about sub-topic 2
*** Other Sub-sub-topic
More text here about the other sub-sub-topic
深度级别的数量是无限的。我想知道如何获得一个能够适当构建嵌套树的解析器。我一直在寻找 indenter example 的灵感,但我一直没有弄明白。
该问题需要上下文相关的语法,因此我们使用您链接的压头示例中的解决方法:
我们编写了一个自定义 postlex
处理器,用于保存观察到的缩进级别的堆栈。当读取到一个star token (*
, **
, ***
, ...) 时,弹出堆栈,直到堆栈上的缩进级别更小,然后压入新级别在堆栈上。对于每个 push/pop,相应的 INDENT/DEDENT 辅助令牌被注入到令牌流中。然后可以在语法中使用这些辅助标记来获得反映嵌套级别的解析树。
from lark import Lark, Token
tree_grammar = r"""
start: NEWLINE* item*
item: STARS nest
nest: _INDENT (nest | LINE+ item*) _DEDENT
STARS.2: /\*+/
LINE.1: /.*/ NEWLINE
%declare _INDENT _DEDENT
%import common.NEWLINE
"""
class StarIndenter():
STARS_type = 'STARS'
INDENT_type = '_INDENT'
DEDENT_type = '_DEDENT'
def dedent(self, level, token):
""" When the given level leaves the current nesting of the stack,
inject corresponding number of DEDENT tokens into the stream.
"""
while level <= self.indent[-1]:
pop_level = self.indent.pop()
pop_diff = pop_level - self.indent[-1]
for _ in range(pop_diff):
yield token
def handle_stars(self, token):
""" Handle tokens of the form '*', '**', '***', ...
"""
level = len(token.value)
dedent_token = Token.new_borrow_pos(self.DEDENT_type, '', token)
yield from self.dedent(level, dedent_token)
diff = level-self.indent[-1]
self.indent.append(level)
# Put star token into stream
yield token
indent_token = Token.new_borrow_pos(self.INDENT_type, '', token)
for _ in range(diff):
yield indent_token
def process(self, stream):
self.indent = [0]
# Process token stream
for token in stream:
if token.type == self.STARS_type:
yield from self.handle_stars(token)
else:
yield token
# Inject closing dedent tokens
yield from self.dedent(1, Token(self.DEDENT_type, ''))
# No idea why this is needed
@property
def always_accept(self):
return ()
parser = Lark(tree_grammar, parser='lalr', postlex=StarIndenter())
注意 STARS
终端被分配了比 LINES
更高的优先级(通过 .2
对比 .1
),以防止 LINES+
吃掉以星号开头的行。
使用示例的简化版本:
test_tree = """
* A
** AA
*** AAA
** AB
*** ABA
"""
print(parser.parse(test_tree).pretty())
结果:
start
item
*
nest
A
item
**
nest
AA
item
***
nest AAA
item
**
nest
AB
item
***
nest ABA