使用 CFG 解析枚举
Parsing enumerations with CFG
我有一个从富文本枚举生成的字符串,例如:
"(1) Let X denote one of the following: (a) weight (b) height (c)
depth (2) Y denote (a) color, except (i) white (ii) blue (b) pressure"
我要构建原始结构,例如:
{"Let X denote one of the following:" : {"weight":{}, "height":{}, "depth":{}} ,
"Y denote": {"color, except": {"white":{}, "blue":{}}}, "pressure":{} }
很明显这是上下文无关语法,但是我在实现它时遇到了麻烦pyparsing
编辑
我不是 CFG 专家,所以我希望这个 BNF 表示是正确的:
假设如下:
w
等同于任何字符 (re.compile("\w*")
)
l
等同于 re.compile("[a-z]")
d
等同于`re.compile("\d+")
r
相当于罗马数字 (i
,ii
,iii
,...)
然后(希望)BNF 应该看起来像这样
<E1>::= "(" <d> ")" | <E1> " "
<E2>::= "(" <l> ")" | <E2> " "
<E3>::= "(" <r> ")" | <E3> " "
<L0>::= <w> | <w> <E1> <L1> <L0>
<L1>::= <w> | <w> <E2> <L2> <L1>
<L2>::= <w> | <w> <E3> <L2>
这是使用 pyparsing 表达式的解析器的第一部分:
import pyparsing as pp
LPAR, RPAR = map(pp.Suppress, "()")
COMMA, COLON = map(pp.Literal, ",:")
wd = pp.Word(pp.alphas)
letter = pp.oneOf(list(pp.alphas.lower()))
integer = pp.pyparsing_common.integer
roman = pp.Word('ivx')
e1 = LPAR + integer + RPAR
e2 = LPAR + letter + RPAR
e3 = LPAR + roman + RPAR
下一部分,基于您的 BNF,可能如下所示:
# predefine levels using Forwards, since they are recursive
lev0 = pp.Forward()
lev1 = pp.Forward()
lev2 = pp.Forward()
lev0 <<= wd | wd + e1 + lev1 + lev0
lev1 <<= wd | wd + e2 + lev2 + lev1
lev2 <<= wd | wd + e3 + lev2
我假设 lev0
应该将您的测试字符串解析为第 0 层嵌套。
正如我在对您的问题的评论中提到的,这会立即失败,因为您的测试字符串以“(1)”开头,但您的级别不以任何 e
表达式开头。
在继续之前,您的 BNF 以 classic BNF 形式实现重复:
e ::= some expression
list_of_e ::= e (list_of_e | empty)
在 pyparsing 中,您可以更直接地实现这一点:
wd = pp.Word(pp.alphas)
list_of_wd = pp.OneOrMore(wd)
# or using tuple multiplication short-hand
list_of_wd = wd * (1,)
查看您的示例,我将您的 BNF 级别重写为:
wds = pp.Group(wd*(1,))
lev0 <<= e1 + wds + lev1*(0,)
lev1 <<= e2 + wds + lev2*(0,)
lev2 <<= e3 + wds
expr = lev0()*(1,)
expr.ignore(COMMA | COLON)
(我没有看到逗号或冒号有助于解析,所以我忽略了它们。)
使用 expr
解析您的字符串:
tests = """\
(1) Y denote (a) color (b) pressure
(1) Let X denote one of the following: (a) weight (b) height (c) depth (2) Y denote (a) color, except (i) white (ii) blue (b) pressure
"""
for test in tests.splitlines():
print(test)
expr.parseString(test).pprint()
print()
我们得到:
(1) Y denote (a) color (b) pressure
[1, ['Y', 'denote'], 'a', ['color'], 'b', ['pressure']]
(1) Let X denote one of the following: (a) weight (b) height (c) depth (2) Y denote (a) color, except (i) white (ii) blue (b) pressure
[1,
['Let', 'X', 'denote', 'one', 'of', 'the', 'following'],
'a',
['weight'],
'b',
['height'],
'c',
['depth'],
2,
['Y', 'denote'],
'a',
['color', 'except'],
'i',
['white'],
'ii',
['blue'],
'b',
['pressure']]
所以它解析了,从某种意义上说,它通过了整个输入字符串,但我们所做的只是基本的标记化,并没有表示您的 integer/alpha/roman 嵌套列表所隐含的任何结构。
Pyparsing 包含一个分组 class 来构造结果:
G = pp.Group
wds = G(wd*(1,))
lev0 <<= G(e1 + G(wds + lev1*(0,)))
lev1 <<= G(e2 + G(wds + lev2*(0,)))
lev2 <<= G(e3 + wds)
expr = lev0()*(1,)
expr.ignore(COMMA | COLON)
这提供了更好地保留层次结构的输出:
(1) Y denote (a) color (b) pressure
[[1, [['Y', 'denote'], ['a', [['color']]], ['b', [['pressure']]]]]]
(1) Let X denote one of the following: (a) weight (b) height (c) depth (2) Y denote (a) color, except (i) white (ii) blue (b) pressure
[[1,
[['Let', 'X', 'denote', 'one', 'of', 'the', 'following'],
['a', [['weight']]],
['b', [['height']]],
['c', [['depth']]]]],
[2,
[['Y', 'denote'],
['a', [['color', 'except'], ['i', ['white']], ['ii', ['blue']]]],
['b', [['pressure']]]]]]
一个完整的解析器实际上会理解 "one of the following" 与 "all of the following" 的概念,以及元素的包含和排除,但这超出了这个问题的范围。
我有一个从富文本枚举生成的字符串,例如:
"(1) Let X denote one of the following: (a) weight (b) height (c) depth (2) Y denote (a) color, except (i) white (ii) blue (b) pressure"
我要构建原始结构,例如:
{"Let X denote one of the following:" : {"weight":{}, "height":{}, "depth":{}} ,
"Y denote": {"color, except": {"white":{}, "blue":{}}}, "pressure":{} }
很明显这是上下文无关语法,但是我在实现它时遇到了麻烦pyparsing
编辑
我不是 CFG 专家,所以我希望这个 BNF 表示是正确的:
假设如下:
w
等同于任何字符 (re.compile("\w*")
)l
等同于re.compile("[a-z]")
d
等同于`re.compile("\d+")r
相当于罗马数字 (i
,ii
,iii
,...)
然后(希望)BNF 应该看起来像这样
<E1>::= "(" <d> ")" | <E1> " "
<E2>::= "(" <l> ")" | <E2> " "
<E3>::= "(" <r> ")" | <E3> " "
<L0>::= <w> | <w> <E1> <L1> <L0>
<L1>::= <w> | <w> <E2> <L2> <L1>
<L2>::= <w> | <w> <E3> <L2>
这是使用 pyparsing 表达式的解析器的第一部分:
import pyparsing as pp
LPAR, RPAR = map(pp.Suppress, "()")
COMMA, COLON = map(pp.Literal, ",:")
wd = pp.Word(pp.alphas)
letter = pp.oneOf(list(pp.alphas.lower()))
integer = pp.pyparsing_common.integer
roman = pp.Word('ivx')
e1 = LPAR + integer + RPAR
e2 = LPAR + letter + RPAR
e3 = LPAR + roman + RPAR
下一部分,基于您的 BNF,可能如下所示:
# predefine levels using Forwards, since they are recursive
lev0 = pp.Forward()
lev1 = pp.Forward()
lev2 = pp.Forward()
lev0 <<= wd | wd + e1 + lev1 + lev0
lev1 <<= wd | wd + e2 + lev2 + lev1
lev2 <<= wd | wd + e3 + lev2
我假设 lev0
应该将您的测试字符串解析为第 0 层嵌套。
正如我在对您的问题的评论中提到的,这会立即失败,因为您的测试字符串以“(1)”开头,但您的级别不以任何 e
表达式开头。
在继续之前,您的 BNF 以 classic BNF 形式实现重复:
e ::= some expression
list_of_e ::= e (list_of_e | empty)
在 pyparsing 中,您可以更直接地实现这一点:
wd = pp.Word(pp.alphas)
list_of_wd = pp.OneOrMore(wd)
# or using tuple multiplication short-hand
list_of_wd = wd * (1,)
查看您的示例,我将您的 BNF 级别重写为:
wds = pp.Group(wd*(1,))
lev0 <<= e1 + wds + lev1*(0,)
lev1 <<= e2 + wds + lev2*(0,)
lev2 <<= e3 + wds
expr = lev0()*(1,)
expr.ignore(COMMA | COLON)
(我没有看到逗号或冒号有助于解析,所以我忽略了它们。)
使用 expr
解析您的字符串:
tests = """\
(1) Y denote (a) color (b) pressure
(1) Let X denote one of the following: (a) weight (b) height (c) depth (2) Y denote (a) color, except (i) white (ii) blue (b) pressure
"""
for test in tests.splitlines():
print(test)
expr.parseString(test).pprint()
print()
我们得到:
(1) Y denote (a) color (b) pressure
[1, ['Y', 'denote'], 'a', ['color'], 'b', ['pressure']]
(1) Let X denote one of the following: (a) weight (b) height (c) depth (2) Y denote (a) color, except (i) white (ii) blue (b) pressure
[1,
['Let', 'X', 'denote', 'one', 'of', 'the', 'following'],
'a',
['weight'],
'b',
['height'],
'c',
['depth'],
2,
['Y', 'denote'],
'a',
['color', 'except'],
'i',
['white'],
'ii',
['blue'],
'b',
['pressure']]
所以它解析了,从某种意义上说,它通过了整个输入字符串,但我们所做的只是基本的标记化,并没有表示您的 integer/alpha/roman 嵌套列表所隐含的任何结构。
Pyparsing 包含一个分组 class 来构造结果:
G = pp.Group
wds = G(wd*(1,))
lev0 <<= G(e1 + G(wds + lev1*(0,)))
lev1 <<= G(e2 + G(wds + lev2*(0,)))
lev2 <<= G(e3 + wds)
expr = lev0()*(1,)
expr.ignore(COMMA | COLON)
这提供了更好地保留层次结构的输出:
(1) Y denote (a) color (b) pressure
[[1, [['Y', 'denote'], ['a', [['color']]], ['b', [['pressure']]]]]]
(1) Let X denote one of the following: (a) weight (b) height (c) depth (2) Y denote (a) color, except (i) white (ii) blue (b) pressure
[[1,
[['Let', 'X', 'denote', 'one', 'of', 'the', 'following'],
['a', [['weight']]],
['b', [['height']]],
['c', [['depth']]]]],
[2,
[['Y', 'denote'],
['a', [['color', 'except'], ['i', ['white']], ['ii', ['blue']]]],
['b', [['pressure']]]]]]
一个完整的解析器实际上会理解 "one of the following" 与 "all of the following" 的概念,以及元素的包含和排除,但这超出了这个问题的范围。