在没有结束标记的情况下解析 elements/fields,非贪婪正则表达式的问题,自定义词法分析器的使用
Parsing elements/fields without end marker, problems with non-greedy regex, usage of a custom lexer
我希望能够解析 Textile 标记语言 (https://textile-lang.com/) 中的文件,以便将其转换为 LaTeX。我拥有的文件有点像 Textile 的扩展,因为它添加了字段和脚注。下面给出了一个示例文件。
test.textile
#[contents]#
p. This is a paragraph.
With some bullet points:
* Bullet 1
* Bullet 2
* Bullet 3
And a code block:
bc.. # Program to display the Fibonacci sequence up to n-th term
"""\
* Program to display the Fibonacci sequence up to n-th term
"""
search_string = r"<5>"
nterms = int(input("How many terms? "))
# first two terms
n1, n2 = 0, 1
count = 0
p. And after the block of code another paragraph, with a footnote<1>.
bc. fn1. This is the footnote contents
p. And here is another paragraph
#[second_field]#
Some more contents
为了解析文件,我有以下解析器。
parser.py
from lark import Lark
def read_file(filename):
with open(filename) as f:
return f.read()
grammar = read_file('grammar.lark')
parser = Lark(grammar, start='elements')
textile = read_file('test.textile')
tree = parser.parse(textile)
print(tree)
还有下面的语法。这个语法还没有解析要点和脚注,因为我 运行 已经进入另一个问题了。
grammar.lark
elements: element+
?element: field content
?field: FIELD_START field_name FIELD_END
field_name: FIELD_NAME
content: contents*
?contents: paragraph
| code_block
code_block: CODE_BLOCK_START STR
paragraph: PARAGRAPH_START? STR
FIELD_START: /(\A|[\r\n]{2,})#\[/
FIELD_NAME: /[^\]]+/
FIELD_END: /\]#[\r\n]/
CODE_BLOCK_START: /bc\.\.? /
PARAGRAPH_START: /p\. /
STR: /.+/s
当我 运行 解析器时,我得到以下输出。
输出
Tree(Token('RULE', 'elements'), [
Tree(Token('RULE', 'element'), [
Tree(Token('RULE', 'field'), [
Token('FIELD_START', '#['),
Tree(Token('RULE', 'field_name'), [
Token('FIELD_NAME', 'contents')]),
Token('FIELD_END', ']#\n')]),
Tree(Token('RULE', 'content'), [
Tree(Token('RULE', 'paragraph'), [
Token('PARAGRAPH_START', 'p. '),
Token('STR', 'This is a paragraph.\n\nWith some bullet points:\n\n* Bullet 1\n* Bullet 2\n* Bullet 3\n\nAnd a code block:\n\nbc.. # Program to display the Fibonacci sequence up to n-th term\n\n"""\\n* Program to display the Fibonacci sequence up to n-th term\n"""\n\nsearch_string = r"<5>"\n\nnterms = int(input("How many terms? "))\n\n# first two terms\nn1, n2 = 0, 1\ncount = 0\n\np. And after the block of code another paragraph, with a footnote<1>.\n\nbc. fn1. This is the footnote contents\n\np. And here is another paragraph\n\n#[second_field]#\nSome more contents\n')])])])])
文件的其余部分被解析为段落,这是正确的,因为 /.+/s
可以匹配任何内容。因此,我将 STR 的定义更改为 /.+?/s
以使其成为非贪婪的,但现在输出如下(打印得很好):
输出
elements
element
field
#[
field_name contents
]#
content
paragraph
p.
T
paragraph h
paragraph i
paragraph s
paragraph
paragraph i
paragraph s
paragraph
paragraph a
paragraph
--snip--
paragraph
paragraph #
paragraph [
paragraph s
paragraph e
paragraph c
paragraph o
paragraph n
paragraph d
paragraph _
paragraph f
paragraph i
paragraph e
paragraph l
paragraph d
paragraph ]
paragraph #
它将每个字符解析为一个段落。它仍然将整个文件解析为段落元素。
我对这个问题的第一个解决方案是创建一个词法分析器,它为 FIELD_START、FIELD_END、PARAGRAPH_START、CODE_BLOCK_START 和与脚注相关的标记创建标记。
我的词法分析器如下所示:
from lark.lexer import Lexer, Token
import re
class MyLexer(Lexer):
def __init__(self, *args, **kwargs):
pass
def lex(self, data):
tokens = {
"FIELD_START": r"(?:\A|[\r\n]{2,})#\[",
"FIELD_END": r"\]#[\r\n]",
"FOOTNOTE_ANCHOR": r"<\d>",
"FOOTNOTE_START": r"bc. fn\d. ",
"PARAGRAPH_START": r"p\. ",
"CODE_BLOCK_START": r"bc\.\.? ",
}
regex = '|'.join([f"({r})" for r in tokens.values()])
for x in re.split(regex, data):
if not x:
continue
for token_type, token_regex in tokens.items():
if re.match(token_regex, x):
yield Token(token_type, x)
break
else:
yield Token("STR", x)
parser = Lark(grammar, lexer=MyLexer, start='elements')
它根据给定的标记创建一个正则表达式,然后通过正则表达式和 returns 将整个字符串拆分为一个标记,定义的标记或“STR”。新语法如下所示:
elements: element+
?element: field content
?field: FIELD_START field_name FIELD_END
field_name: STR
content: contents*
?contents: STR
| paragraph
| code_block
code_block: CODE_BLOCK_START STR
paragraph: PARAGRAPH_START? paragraph_contents+
?paragraph_contents: STR
| FOOTNOTE_ANCHOR
| footnote
footnote: FOOTNOTE_START STR
%declare FIELD_START FIELD_END FOOTNOTE_ANCHOR FOOTNOTE_START PARAGRAPH_START CODE_BLOCK_START STR
解析器的输出如下:
Tree(Token('RULE', 'elements'), [
Tree(Token('RULE', 'element'), [
Tree(Token('RULE', 'field'), [
Token('FIELD_START', '#['),
Tree(Token('RULE', 'field_name'), [
Token('STR', 'contents')]),
Token('FIELD_END', ']#\n')]),
Tree(Token('RULE', 'content'), [
Tree(Token('RULE', 'paragraph'), [
Token('PARAGRAPH_START', 'p. '),
Token('STR', 'This is a paragraph.\n\nWith some bullet points:\n\n* Bullet 1\n* Bullet 2\n* Bullet 3\n\nAnd a code block:\n\n')]),
Tree(Token('RULE', 'code_block'), [
Token('CODE_BLOCK_START', 'bc.. '),
Token('STR', '# Program to display the Fibonacci sequence up to n-th term\n\n"""\\n* Program to display the Fibonacci sequence up to n-th term\n"""\n\nsearch_string = r"')]),
Tree(Token('RULE', 'paragraph'), [
Token('FOOTNOTE_ANCHOR', '<5>'),
Token('STR', '"\n\nnterms = int(input("How many terms? "))\n\n# first two terms\nn1, n2 = 0, 1\ncount = 0\n\n')]),
Tree(Token('RULE', 'paragraph'), [
Token('PARAGRAPH_START', 'p. '),
Token('STR', 'And after the block of code another paragraph, with a footnote'),
Token('FOOTNOTE_ANCHOR', '<1>'),
Token('STR', '.\n\n'),
Tree(Token('RULE', 'footnote'), [
Token('FOOTNOTE_START', 'bc. fn1. '),
Token('STR', 'This is the footnote contents\n\n')])]),
Tree(Token('RULE', 'paragraph'), [
Token('PARAGRAPH_START', 'p. '),
Token('STR', 'And here is another paragraph')])])]),
Tree(Token('RULE', 'element'), [
Tree(Token('RULE', 'field'), [
Token('FIELD_START', '\n\n#['),
Tree(Token('RULE', 'field_name'), [
Token('STR', 'second_field')]),
Token('FIELD_END', ']#\n')]),
Tree(Token('RULE', 'content'), [
Token('STR', 'Some more contents\n')])])])
这正确地解析了不同的字段和脚注,但是,代码块被检测到的 FOOTNOTE_ANCHOR 中断了。因为 Lexer 不知道上下文,所以它也尝试解析代码中的脚注锚点。尝试解析要点时也会出现同样的问题。
问题的最佳解决方案是什么?我真的需要词法分析器吗?我的词法分析器是否正确实施? (关于如何对文本使用自定义词法分析器,我可以找到很少的示例)。我可以只对一些标记进行词法分析,而将其余的留给“父”词法分析器吗?
基于我找到了解决方案。
重要的是不要使用 /.+/s
来匹配多行,因为这样解析器就没有机会匹配其他标记了。最好逐行匹配,这样解析器就有机会为每一行匹配一条新规则。我还将解析器切换为“lalr”,它不适用于标准解析器。
parser.py
from lark import Lark
def read_file(filename):
with open(filename) as f:
return f.read()
grammar = read_file('grammar.lark')
parser = Lark(grammar, start='elements', parser="lalr")
textile = read_file('test.textile')
tree = parser.parse(textile)
print(tree.pretty())
grammar.lark
elements: element+
?element: field content
?field: NEWLINE* FIELD_START field_name FIELD_END NEWLINE
field_name: FIELD_NAME
content: contents*
?contents: paragraph
| code_block
code_block: CODE_BLOCK_START (LINE NEWLINE)+
paragraph: PARAGRAPH_START? (paragraph_line | bullets | footnote)+
bullets: (BULLET paragraph_line)+
footnote: FOOTNOTE_START LINE NEWLINE
paragraph_line: (PARAGRAPH_LINE | FOOTNOTE_ANCHOR)+ NEWLINE
FIELD_START: "#["
FIELD_NAME: /[^\]]+/
FIELD_END: "]#"
FOOTNOTE_ANCHOR: /<\d>/
FOOTNOTE_START: /bc\. fn\d\. /
CODE_BLOCK_START: /bc\.\.? /
PARAGRAPH_START: /p\. /
LINE.-1: /.+/
BULLET.-2: "*"
PARAGRAPH_LINE.-3: /.+?(?=(<\d>|\r|\n))/
%import common.NEWLINE
输出:
elements
element
field
#[
field_name contents
]#
content
paragraph
p.
paragraph_line
This is a paragraph.
paragraph_line
With some bullet points:
bullets
*
paragraph_line
Bullet 1
*
paragraph_line
Bullet 2
*
paragraph_line
Bullet 3
paragraph_line
And a code block:
code_block
bc..
# Program to display the Fibonacci sequence up to n-th term
"""\
* Program to display the Fibonacci sequence up to n-th term
"""
search_string = r"<5>"
nterms = int(input("How many terms? "))
# first two terms
n1, n2 = 0, 1
count = 0
paragraph
p.
paragraph_line
And after the block of code another paragraph, with a footnote
<1>
.
footnote
bc. fn1.
This is the footnote contents
paragraph
p.
paragraph_line
And here is another paragraph
element
field
#[
field_name second_field
]#
content
paragraph
paragraph_line
Some more contents
请注意,解析器还可以正确传递项目符号和脚注。为了解析一行中的脚注锚点,我做了一个特殊的 PARAGRAPH_LINE
,它在遇到的第一个脚注或行尾停止。另请注意,行优先于项目符号,因此项目符号不会在代码块中匹配(因为它会查找普通行),只会在段落中匹配。
我希望能够解析 Textile 标记语言 (https://textile-lang.com/) 中的文件,以便将其转换为 LaTeX。我拥有的文件有点像 Textile 的扩展,因为它添加了字段和脚注。下面给出了一个示例文件。
test.textile
#[contents]#
p. This is a paragraph.
With some bullet points:
* Bullet 1
* Bullet 2
* Bullet 3
And a code block:
bc.. # Program to display the Fibonacci sequence up to n-th term
"""\
* Program to display the Fibonacci sequence up to n-th term
"""
search_string = r"<5>"
nterms = int(input("How many terms? "))
# first two terms
n1, n2 = 0, 1
count = 0
p. And after the block of code another paragraph, with a footnote<1>.
bc. fn1. This is the footnote contents
p. And here is another paragraph
#[second_field]#
Some more contents
为了解析文件,我有以下解析器。
parser.py
from lark import Lark
def read_file(filename):
with open(filename) as f:
return f.read()
grammar = read_file('grammar.lark')
parser = Lark(grammar, start='elements')
textile = read_file('test.textile')
tree = parser.parse(textile)
print(tree)
还有下面的语法。这个语法还没有解析要点和脚注,因为我 运行 已经进入另一个问题了。
grammar.lark
elements: element+
?element: field content
?field: FIELD_START field_name FIELD_END
field_name: FIELD_NAME
content: contents*
?contents: paragraph
| code_block
code_block: CODE_BLOCK_START STR
paragraph: PARAGRAPH_START? STR
FIELD_START: /(\A|[\r\n]{2,})#\[/
FIELD_NAME: /[^\]]+/
FIELD_END: /\]#[\r\n]/
CODE_BLOCK_START: /bc\.\.? /
PARAGRAPH_START: /p\. /
STR: /.+/s
当我 运行 解析器时,我得到以下输出。
输出
Tree(Token('RULE', 'elements'), [
Tree(Token('RULE', 'element'), [
Tree(Token('RULE', 'field'), [
Token('FIELD_START', '#['),
Tree(Token('RULE', 'field_name'), [
Token('FIELD_NAME', 'contents')]),
Token('FIELD_END', ']#\n')]),
Tree(Token('RULE', 'content'), [
Tree(Token('RULE', 'paragraph'), [
Token('PARAGRAPH_START', 'p. '),
Token('STR', 'This is a paragraph.\n\nWith some bullet points:\n\n* Bullet 1\n* Bullet 2\n* Bullet 3\n\nAnd a code block:\n\nbc.. # Program to display the Fibonacci sequence up to n-th term\n\n"""\\n* Program to display the Fibonacci sequence up to n-th term\n"""\n\nsearch_string = r"<5>"\n\nnterms = int(input("How many terms? "))\n\n# first two terms\nn1, n2 = 0, 1\ncount = 0\n\np. And after the block of code another paragraph, with a footnote<1>.\n\nbc. fn1. This is the footnote contents\n\np. And here is another paragraph\n\n#[second_field]#\nSome more contents\n')])])])])
文件的其余部分被解析为段落,这是正确的,因为 /.+/s
可以匹配任何内容。因此,我将 STR 的定义更改为 /.+?/s
以使其成为非贪婪的,但现在输出如下(打印得很好):
输出
elements
element
field
#[
field_name contents
]#
content
paragraph
p.
T
paragraph h
paragraph i
paragraph s
paragraph
paragraph i
paragraph s
paragraph
paragraph a
paragraph
--snip--
paragraph
paragraph #
paragraph [
paragraph s
paragraph e
paragraph c
paragraph o
paragraph n
paragraph d
paragraph _
paragraph f
paragraph i
paragraph e
paragraph l
paragraph d
paragraph ]
paragraph #
它将每个字符解析为一个段落。它仍然将整个文件解析为段落元素。
我对这个问题的第一个解决方案是创建一个词法分析器,它为 FIELD_START、FIELD_END、PARAGRAPH_START、CODE_BLOCK_START 和与脚注相关的标记创建标记。
我的词法分析器如下所示:
from lark.lexer import Lexer, Token
import re
class MyLexer(Lexer):
def __init__(self, *args, **kwargs):
pass
def lex(self, data):
tokens = {
"FIELD_START": r"(?:\A|[\r\n]{2,})#\[",
"FIELD_END": r"\]#[\r\n]",
"FOOTNOTE_ANCHOR": r"<\d>",
"FOOTNOTE_START": r"bc. fn\d. ",
"PARAGRAPH_START": r"p\. ",
"CODE_BLOCK_START": r"bc\.\.? ",
}
regex = '|'.join([f"({r})" for r in tokens.values()])
for x in re.split(regex, data):
if not x:
continue
for token_type, token_regex in tokens.items():
if re.match(token_regex, x):
yield Token(token_type, x)
break
else:
yield Token("STR", x)
parser = Lark(grammar, lexer=MyLexer, start='elements')
它根据给定的标记创建一个正则表达式,然后通过正则表达式和 returns 将整个字符串拆分为一个标记,定义的标记或“STR”。新语法如下所示:
elements: element+
?element: field content
?field: FIELD_START field_name FIELD_END
field_name: STR
content: contents*
?contents: STR
| paragraph
| code_block
code_block: CODE_BLOCK_START STR
paragraph: PARAGRAPH_START? paragraph_contents+
?paragraph_contents: STR
| FOOTNOTE_ANCHOR
| footnote
footnote: FOOTNOTE_START STR
%declare FIELD_START FIELD_END FOOTNOTE_ANCHOR FOOTNOTE_START PARAGRAPH_START CODE_BLOCK_START STR
解析器的输出如下:
Tree(Token('RULE', 'elements'), [
Tree(Token('RULE', 'element'), [
Tree(Token('RULE', 'field'), [
Token('FIELD_START', '#['),
Tree(Token('RULE', 'field_name'), [
Token('STR', 'contents')]),
Token('FIELD_END', ']#\n')]),
Tree(Token('RULE', 'content'), [
Tree(Token('RULE', 'paragraph'), [
Token('PARAGRAPH_START', 'p. '),
Token('STR', 'This is a paragraph.\n\nWith some bullet points:\n\n* Bullet 1\n* Bullet 2\n* Bullet 3\n\nAnd a code block:\n\n')]),
Tree(Token('RULE', 'code_block'), [
Token('CODE_BLOCK_START', 'bc.. '),
Token('STR', '# Program to display the Fibonacci sequence up to n-th term\n\n"""\\n* Program to display the Fibonacci sequence up to n-th term\n"""\n\nsearch_string = r"')]),
Tree(Token('RULE', 'paragraph'), [
Token('FOOTNOTE_ANCHOR', '<5>'),
Token('STR', '"\n\nnterms = int(input("How many terms? "))\n\n# first two terms\nn1, n2 = 0, 1\ncount = 0\n\n')]),
Tree(Token('RULE', 'paragraph'), [
Token('PARAGRAPH_START', 'p. '),
Token('STR', 'And after the block of code another paragraph, with a footnote'),
Token('FOOTNOTE_ANCHOR', '<1>'),
Token('STR', '.\n\n'),
Tree(Token('RULE', 'footnote'), [
Token('FOOTNOTE_START', 'bc. fn1. '),
Token('STR', 'This is the footnote contents\n\n')])]),
Tree(Token('RULE', 'paragraph'), [
Token('PARAGRAPH_START', 'p. '),
Token('STR', 'And here is another paragraph')])])]),
Tree(Token('RULE', 'element'), [
Tree(Token('RULE', 'field'), [
Token('FIELD_START', '\n\n#['),
Tree(Token('RULE', 'field_name'), [
Token('STR', 'second_field')]),
Token('FIELD_END', ']#\n')]),
Tree(Token('RULE', 'content'), [
Token('STR', 'Some more contents\n')])])])
这正确地解析了不同的字段和脚注,但是,代码块被检测到的 FOOTNOTE_ANCHOR 中断了。因为 Lexer 不知道上下文,所以它也尝试解析代码中的脚注锚点。尝试解析要点时也会出现同样的问题。
问题的最佳解决方案是什么?我真的需要词法分析器吗?我的词法分析器是否正确实施? (关于如何对文本使用自定义词法分析器,我可以找到很少的示例)。我可以只对一些标记进行词法分析,而将其余的留给“父”词法分析器吗?
基于
重要的是不要使用 /.+/s
来匹配多行,因为这样解析器就没有机会匹配其他标记了。最好逐行匹配,这样解析器就有机会为每一行匹配一条新规则。我还将解析器切换为“lalr”,它不适用于标准解析器。
parser.py
from lark import Lark
def read_file(filename):
with open(filename) as f:
return f.read()
grammar = read_file('grammar.lark')
parser = Lark(grammar, start='elements', parser="lalr")
textile = read_file('test.textile')
tree = parser.parse(textile)
print(tree.pretty())
grammar.lark
elements: element+
?element: field content
?field: NEWLINE* FIELD_START field_name FIELD_END NEWLINE
field_name: FIELD_NAME
content: contents*
?contents: paragraph
| code_block
code_block: CODE_BLOCK_START (LINE NEWLINE)+
paragraph: PARAGRAPH_START? (paragraph_line | bullets | footnote)+
bullets: (BULLET paragraph_line)+
footnote: FOOTNOTE_START LINE NEWLINE
paragraph_line: (PARAGRAPH_LINE | FOOTNOTE_ANCHOR)+ NEWLINE
FIELD_START: "#["
FIELD_NAME: /[^\]]+/
FIELD_END: "]#"
FOOTNOTE_ANCHOR: /<\d>/
FOOTNOTE_START: /bc\. fn\d\. /
CODE_BLOCK_START: /bc\.\.? /
PARAGRAPH_START: /p\. /
LINE.-1: /.+/
BULLET.-2: "*"
PARAGRAPH_LINE.-3: /.+?(?=(<\d>|\r|\n))/
%import common.NEWLINE
输出:
elements
element
field
#[
field_name contents
]#
content
paragraph
p.
paragraph_line
This is a paragraph.
paragraph_line
With some bullet points:
bullets
*
paragraph_line
Bullet 1
*
paragraph_line
Bullet 2
*
paragraph_line
Bullet 3
paragraph_line
And a code block:
code_block
bc..
# Program to display the Fibonacci sequence up to n-th term
"""\
* Program to display the Fibonacci sequence up to n-th term
"""
search_string = r"<5>"
nterms = int(input("How many terms? "))
# first two terms
n1, n2 = 0, 1
count = 0
paragraph
p.
paragraph_line
And after the block of code another paragraph, with a footnote
<1>
.
footnote
bc. fn1.
This is the footnote contents
paragraph
p.
paragraph_line
And here is another paragraph
element
field
#[
field_name second_field
]#
content
paragraph
paragraph_line
Some more contents
请注意,解析器还可以正确传递项目符号和脚注。为了解析一行中的脚注锚点,我做了一个特殊的 PARAGRAPH_LINE
,它在遇到的第一个脚注或行尾停止。另请注意,行优先于项目符号,因此项目符号不会在代码块中匹配(因为它会查找普通行),只会在段落中匹配。