在没有结束标记的情况下解析 elements/fields,非贪婪正则表达式的问题,自定义词法分析器的使用

Parsing elements/fields without end marker, problems with non-greedy regex, usage of a custom lexer

我希望能够解析 Textile 标记语言 (https://textile-lang.com/) 中的文件,以便将其转换为 LaTeX。我拥有的文件有点像 Textile 的扩展,因为它添加了字段和脚注。下面给出了一个示例文件。

test.textile

#[contents]#
p. This is a paragraph.

With some bullet points:

* Bullet 1
* Bullet 2
* Bullet 3

And a code block:

bc.. # Program to display the Fibonacci sequence up to n-th term

"""\
* Program to display the Fibonacci sequence up to n-th term
"""

search_string = r"<5>"

nterms = int(input("How many terms? "))

# first two terms
n1, n2 = 0, 1
count = 0

p. And after the block of code another paragraph, with a footnote<1>.

bc. fn1. This is the footnote contents

p. And here is another paragraph

#[second_field]#
Some more contents

为了解析文件,我有以下解析器。

parser.py

from lark import Lark

def read_file(filename):
    with open(filename) as f:
        return f.read()

grammar = read_file('grammar.lark')
parser = Lark(grammar, start='elements')
textile = read_file('test.textile')
tree = parser.parse(textile)
print(tree)

还有下面的语法。这个语法还没有解析要点和脚注,因为我 运行 已经进入另一个问题了。

grammar.lark

elements: element+
?element: field content
?field: FIELD_START field_name FIELD_END
field_name: FIELD_NAME
content: contents*
?contents: paragraph
    | code_block
code_block: CODE_BLOCK_START STR
paragraph: PARAGRAPH_START? STR

FIELD_START: /(\A|[\r\n]{2,})#\[/
FIELD_NAME: /[^\]]+/
FIELD_END: /\]#[\r\n]/
CODE_BLOCK_START: /bc\.\.? /
PARAGRAPH_START: /p\. /
STR: /.+/s

当我 运行 解析器时,我得到以下输出。

输出

Tree(Token('RULE', 'elements'), [
    Tree(Token('RULE', 'element'), [
        Tree(Token('RULE', 'field'), [
            Token('FIELD_START', '#['), 
            Tree(Token('RULE', 'field_name'), [
                Token('FIELD_NAME', 'contents')]), 
            Token('FIELD_END', ']#\n')]), 
        Tree(Token('RULE', 'content'), [
            Tree(Token('RULE', 'paragraph'), [
                Token('PARAGRAPH_START', 'p. '), 
                Token('STR', 'This is a paragraph.\n\nWith some bullet points:\n\n* Bullet 1\n* Bullet 2\n* Bullet 3\n\nAnd a code block:\n\nbc.. # Program to display the Fibonacci sequence up to n-th term\n\n"""\\n* Program to display the Fibonacci sequence up to n-th term\n"""\n\nsearch_string = r"<5>"\n\nnterms = int(input("How many terms? "))\n\n# first two terms\nn1, n2 = 0, 1\ncount = 0\n\np. And after the block of code another paragraph, with a footnote<1>.\n\nbc. fn1. This is the footnote contents\n\np. And here is another paragraph\n\n#[second_field]#\nSome more contents\n')])])])])

文件的其余部分被解析为段落,这是正确的,因为 /.+/s 可以匹配任何内容。因此,我将 STR 的定义更改为 /.+?/s 以使其成为非贪婪的,但现在输出如下(打印得很好):

输出

elements
  element
    field
      #[
      field_name        contents
      ]#

    content
      paragraph
        p.
        T
      paragraph h
      paragraph i
      paragraph s
      paragraph
      paragraph i
      paragraph s
      paragraph
      paragraph a
      paragraph
--snip--
      paragraph

      paragraph #
      paragraph [
      paragraph s
      paragraph e
      paragraph c
      paragraph o
      paragraph n
      paragraph d
      paragraph _
      paragraph f
      paragraph i
      paragraph e
      paragraph l
      paragraph d
      paragraph ]
      paragraph #

它将每个字符解析为一个段落。它仍然将整个文件解析为段落元素。

我对这个问题的第一个解决方案是创建一个词法分析器,它为 FIELD_START、FIELD_END、PARAGRAPH_START、CODE_BLOCK_START 和与脚注相关的标记创建标记。

我的词法分析器如下所示:

from lark.lexer import Lexer, Token
import re

class MyLexer(Lexer):
    def __init__(self, *args, **kwargs):
        pass

    def lex(self, data):
        tokens = {
            "FIELD_START": r"(?:\A|[\r\n]{2,})#\[",
            "FIELD_END": r"\]#[\r\n]",
            "FOOTNOTE_ANCHOR": r"<\d>",
            "FOOTNOTE_START": r"bc. fn\d. ",
            "PARAGRAPH_START": r"p\. ",
            "CODE_BLOCK_START": r"bc\.\.? ",
        }
        regex = '|'.join([f"({r})" for r in tokens.values()])
        for x in re.split(regex, data):
            if not x:
                continue
            for token_type, token_regex in tokens.items():
                if re.match(token_regex, x):
                    yield Token(token_type, x)
                    break
            else:
                yield Token("STR", x)

parser = Lark(grammar, lexer=MyLexer, start='elements')

它根据给定的标记创建一个正则表达式,然后通过正则表达式和 returns 将整个字符串拆分为一个标记,定义的标记或“STR”。新语法如下所示:

elements: element+
?element: field content
?field: FIELD_START field_name FIELD_END
field_name: STR
content: contents*
?contents: STR 
    | paragraph 
    | code_block
code_block: CODE_BLOCK_START STR
paragraph: PARAGRAPH_START? paragraph_contents+
?paragraph_contents: STR 
    | FOOTNOTE_ANCHOR
    | footnote
footnote: FOOTNOTE_START STR

%declare FIELD_START FIELD_END FOOTNOTE_ANCHOR FOOTNOTE_START PARAGRAPH_START CODE_BLOCK_START STR

解析器的输出如下:

Tree(Token('RULE', 'elements'), [
    Tree(Token('RULE', 'element'), [
        Tree(Token('RULE', 'field'), [
            Token('FIELD_START', '#['), 
            Tree(Token('RULE', 'field_name'), [
                Token('STR', 'contents')]), 
            Token('FIELD_END', ']#\n')]), 
        Tree(Token('RULE', 'content'), [
            Tree(Token('RULE', 'paragraph'), [
                Token('PARAGRAPH_START', 'p. '), 
                Token('STR', 'This is a paragraph.\n\nWith some bullet points:\n\n* Bullet 1\n* Bullet 2\n* Bullet 3\n\nAnd a code block:\n\n')]), 
            Tree(Token('RULE', 'code_block'), [
                Token('CODE_BLOCK_START', 'bc.. '), 
                Token('STR', '# Program to display the Fibonacci sequence up to n-th term\n\n"""\\n* Program to display the Fibonacci sequence up to n-th term\n"""\n\nsearch_string = r"')]), 
            Tree(Token('RULE', 'paragraph'), [
                Token('FOOTNOTE_ANCHOR', '<5>'), 
                Token('STR', '"\n\nnterms = int(input("How many terms? "))\n\n# first two terms\nn1, n2 = 0, 1\ncount = 0\n\n')]), 
            Tree(Token('RULE', 'paragraph'), [
                Token('PARAGRAPH_START', 'p. '), 
                Token('STR', 'And after the block of code another paragraph, with a footnote'), 
                Token('FOOTNOTE_ANCHOR', '<1>'), 
                Token('STR', '.\n\n'), 
                Tree(Token('RULE', 'footnote'), [
                    Token('FOOTNOTE_START', 'bc. fn1. '), 
                    Token('STR', 'This is the footnote contents\n\n')])]), 
            Tree(Token('RULE', 'paragraph'), [
                Token('PARAGRAPH_START', 'p. '), 
                Token('STR', 'And here is another paragraph')])])]), 
    Tree(Token('RULE', 'element'), [
        Tree(Token('RULE', 'field'), [
            Token('FIELD_START', '\n\n#['), 
            Tree(Token('RULE', 'field_name'), [
                Token('STR', 'second_field')]), 
            Token('FIELD_END', ']#\n')]), 
        Tree(Token('RULE', 'content'), [
            Token('STR', 'Some more contents\n')])])])

这正确地解析了不同的字段和脚注,但是,代码块被检测到的 FOOTNOTE_ANCHOR 中断了。因为 Lexer 不知道上下文,所以它也尝试解析代码中的脚注锚点。尝试解析要点时也会出现同样的问题。

问题的最佳解决方案是什么?我真的需要词法分析器吗?我的词法分析器是否正确实施? (关于如何对文本使用自定义词法分析器,我可以找到很少的示例)。我可以只对一些标记进行词法分析,而将其余的留给“父”词法分析器吗?

基于我找到了解决方案。

重要的是不要使用 /.+/s 来匹配多行,因为这样解析器就没有机会匹配其他标记了。最好逐行匹配,这样解析器就有机会为每一行匹配一条新规则。我还将解析器切换为“lalr”,它不适用于标准解析器。

parser.py

from lark import Lark

def read_file(filename):
    with open(filename) as f:
        return f.read()

grammar = read_file('grammar.lark')
parser = Lark(grammar, start='elements', parser="lalr")
textile = read_file('test.textile')
tree = parser.parse(textile)
print(tree.pretty())

grammar.lark

elements: element+
?element: field content
?field: NEWLINE* FIELD_START field_name FIELD_END NEWLINE
field_name: FIELD_NAME
content: contents*
?contents: paragraph
    | code_block
code_block: CODE_BLOCK_START (LINE NEWLINE)+
paragraph: PARAGRAPH_START? (paragraph_line | bullets | footnote)+
bullets: (BULLET paragraph_line)+
footnote: FOOTNOTE_START LINE NEWLINE
paragraph_line: (PARAGRAPH_LINE | FOOTNOTE_ANCHOR)+ NEWLINE

FIELD_START: "#["
FIELD_NAME: /[^\]]+/
FIELD_END: "]#"
FOOTNOTE_ANCHOR: /<\d>/
FOOTNOTE_START: /bc\. fn\d\. /
CODE_BLOCK_START: /bc\.\.? /
PARAGRAPH_START: /p\. /
LINE.-1: /.+/
BULLET.-2: "*"
PARAGRAPH_LINE.-3: /.+?(?=(<\d>|\r|\n))/

%import common.NEWLINE

输出:

elements
  element
    field
      #[
      field_name        contents
      ]#


    content
      paragraph
        p.
        paragraph_line
          This is a paragraph.



        paragraph_line
          With some bullet points:



        bullets
          *
          paragraph_line
             Bullet 1


          *
          paragraph_line
             Bullet 2


          *
          paragraph_line
             Bullet 3



        paragraph_line
          And a code block:



      code_block
        bc..
        # Program to display the Fibonacci sequence up to n-th term



        """\


        * Program to display the Fibonacci sequence up to n-th term


        """



        search_string = r"<5>"



        nterms = int(input("How many terms? "))



        # first two terms


        n1, n2 = 0, 1


        count = 0



      paragraph
        p.
        paragraph_line
          And after the block of code another paragraph, with a footnote
          <1>
          .



        footnote
          bc. fn1.
          This is the footnote contents



      paragraph
        p.
        paragraph_line
          And here is another paragraph



  element
    field
      #[
      field_name        second_field
      ]#


    content
      paragraph
        paragraph_line
          Some more contents

请注意,解析器还可以正确传递项目符号和脚注。为了解析一行中的脚注锚点,我做了一个特殊的 PARAGRAPH_LINE,它在遇到的第一个脚注或行尾停止。另请注意,行优先于项目符号,因此项目符号不会在代码块中匹配(因为它会查找普通行),只会在段落中匹配。