pyparsing 中的 Group() 是否需要 post 处理步骤来生成特定于正在解析的语言的结构

Does Group() in pyparsing requires a post processing step to generate a structure specific to the langauge being parsed

这是一个 extension of this question。我将 pyparsing 代码编写为语法的一对一翻译。

我的 DSL:

response:success
response:success AND extension:php OR extension:css
response:sucess AND (extension:php OR extension:css)
time >= 2020-01-09
time >= 2020-01-09 AND response:success OR os:windows
NOT reponse:success
response:success AND NOT os:windows

DSL 之上的 EBNF 语法:

<expr> ::= <or>
<or> ::= <and> (" OR " <and>)*
<and> ::= <unary> ((" AND ") <unary>)*
<unary> ::= " NOT " <unary> | <equality>
<equality> ::=  (<word> ":" <word>) | <comparison>
<comparison> ::= "(" <expr> ")" | (<word> (" > " | " >= " | " < " | " <= ") <word>)+
<word> ::= ("a" | "b" | "c" | "d" | "e" | "f" | "g"
                      | "h" | "i" | "j" | "k" | "l" | "m" | "n"
                      | "o" | "p" | "q" | "r" | "s" | "t" | "u"
                      | "v" | "w" | "x" | "y" | "z")+

现在我可以获得令牌列表。下一步是生成某种 ast/structure 以便我可以从每个节点类型生成代码?

pyparsing 上阅读了一些示例后,我想我对如何处理这个问题有了一个模糊的想法:

1) 我可以使用 Group() 将对代码生成很重要的相关结构分组在一起,每个分组可能代表 ast 中的一个节点。
2) 与 Group() 一起,我可以使用 setParseAction() 在解析阶段本身直接编码我的节点对象的 python 表示,而不是先生成结构。

My approach in code:
AND = Keyword('AND')
OR  = Keyword('OR')
NOT = Keyword('NOT')
word = Word(alphanums+'_')




expr = Forward()
Comparison = Literal('(') + expr + Literal(')')  + OneOrMore(word + ( Literal('>') | Literal('>=') | Literal('<') | Literal('<=')) + word)
Equality = Group((word('searchKey') + Literal(':') + word('searchValue')) | Comparison)
Unary = Forward()
unaryNot = NOT + Unary
Unary << (unaryNot | Equality)
And = Group(Unary + ZeroOrMore(AND + Unary))
Or = And + ZeroOrMore(OR + And)

expr << Or



class AndNode:
    def __init__(self, tokens):
        self.tokens = tokens.asList()

    def query(self):
        pass #generate the relevant elastic search query here?


class ExactMatchNode:
    def __init__(self, tokens):
        self.tokens = tokens

    def __repr__(self):
        return "<ExactMatchNode>"
    def query(self):
        pass #generate the relevant elasticsearch query here?


Equality.setParseAction(ExactMatchNode)




Q1 = '''response:200 AND time:22 AND rex:32 OR NOT demo:good'''
result = expr.parseString(Q1)

print(result.dump())

这是我的输出:

[[<ExactMatchNode>, 'AND', <ExactMatchNode>, 'AND', <ExactMatchNode>], 'OR', ['NOT', <ExactMatchNode>]]
[0]:
  [<ExactMatchNode>, 'AND', <ExactMatchNode>, 'AND', <ExactMatchNode>]
[1]:
  OR
[2]:
  ['NOT', <ExactMatchNode>]

此时我迷路了,因为这如何表示树结构?例如

[<ExactMatchNode>, 'AND', <ExactMatchNode>, 'AND', <ExactMatchNode>]

应该是这样的吧?

[AND [<ExactMatchNode>, <ExactMatchNode>,  <ExactMatchNode>]]

我想这可以在 setParseAction 中完成,但我不确定这是正确的方向吗?或者我应该在这一点上开始修改我的语法。这个 DSL 的最终目标是将给定的查询翻译成 elasticsearch json 查询语言。

编辑: 在尝试了一些事情之后,这就是我所拥有的:

class NotNode:
    def __init__(self, tokens):
        self.negatearg = tokens
        #print(f'**** \n {self.negatearg} \n +++')

    def __repr__(self):
        return f'( NOT-> {self.negatearg} )'

class AndNode:
    def __init__(self, tokens):
        self.conds = tokens[0][::2]
        #print(f'**** \n {tokens} \n +++')

    def __repr__(self):
        return f'( AND-> {self.conds} )'

    def generate_query(self):
        result = [cond.generate_query() for cond in self.conds]
        return result


class ExactMatchNode:
    def __init__(self, tokens):
        self.tokens = tokens[0]
        #print(f'**** \n {tokens} \n +++')

    def __repr__(self):
        return f"<ExactMatchNode {self.tokens.searchKey}={self.tokens.searchValue}>"

    def generate_query(self):
        return {
                'term' : { self.tokens[0]: self.tokens[2]}
        }


unaryNot.setParseAction(NotNode)
Equality.setParseAction(ExactMatchNode)
And.setParseAction(AndNode)

我现在可以使用 <some node object>.generate_query() 进行查询。

但是我在下面的输出中注意到一件奇怪的事情是:

[( AND-> [<ExactMatchNode response=200>, <ExactMatchNode time=22>, <ExactMatchNode rex=32>] ), 'OR', ( AND-> [( NOT-> ['NOT', <ExactMatchNode demo=good>] )] )] 

第二个 AND-> 附加在 NOT 节点之前。

我的问题还是一样,这是使用 pyparsing 的正确方法还是我错过了一些明显的东西并且走错了方向?

使用 setParseAction 附加节点 classes 是我发现从层次语法构建 AST 的最佳方式。如果您使用此方法,您可能不需要 Group 构造。你得到第二个 And 的原因是因为你的解析器 always 产生一个 AndNode,即使只有一个操作数没有额外的 AND operand.

如果存在 operand AND operand(同样适用于 NOT 和 OR),您可以扩展 And 表达式以仅附加 AndNode 解析操作 class,例如:

And = (Unary + OneOrMore(AND + Unary)).addParseAction(AndNode) | Unary
Or = (And + OneOrMore(OR + And)).addParseAction(OrNode) | And

这就是 pyparsing 的 infixNotation 处理这类运算符的方式。

我的解析器版本,使用 infixNotation(我认为 classes 几乎完全相同,也许我调整了 NotNode 定义):

"""
<expr> ::= <or>
<or> ::= <and> (" OR " <and>)*
<and> ::= <unary> ((" AND ") <unary>)*
<unary> ::= " NOT " <unary> | <equality>
<equality> ::=  (<word> ":" <word>) | <comparison>
<comparison> ::= "(" <expr> ")" | (<word> (" > " | " >= " | " < " | " <= ") <word>)+
<word> ::= ("a" | "b" | "c" | "d" | "e" | "f" | "g"
                      | "h" | "i" | "j" | "k" | "l" | "m" | "n"
                      | "o" | "p" | "q" | "r" | "s" | "t" | "u"
                      | "v" | "w" | "x" | "y" | "z")+
"""

import pyparsing as pp

NOT, AND, OR = map(pp.Keyword, "NOT AND OR".split())

word = ~(NOT | AND | OR) + pp.Word(pp.alphas.lower() + '-_')
date = pp.Regex(r"\d{4}-\d{2}-\d{2}")
operand = word | date

class ExactMatchNode:
    def __init__(self, tokens):
        self.tokens = tokens

    def __repr__(self):
        return "<ExactMatchNode>"
    def query(self):
        pass #generate the relevant elasticsearch query here?

class ComparisonNode:
    def __init__(self, tokens):
        self.tokens = tokens

    def __repr__(self):
        return "<ComparisonNode>"
    def query(self):
        pass #generate the relevant elasticsearch query here?

class NotNode:
    def __init__(self, tokens):
        self.negatearg = tokens[0][1]
        #print(f'**** \n {self.negatearg} \n +++')

    def __repr__(self):
        return f'( NOT-> {self.negatearg} )'

class AndNode:
    def __init__(self, tokens):
        self.conds = tokens[0][::2]
        #print(f'**** \n {tokens} \n +++')

    def __repr__(self):
        return f'( AND-> {self.conds} )'

    def generate_query(self):
        result = [cond.generate_query() for cond in self.conds]
        return result

class OrNode:
    def __init__(self, tokens):
        self.conds = tokens[0][::2]
        #print(f'**** \n {tokens} \n +++')

    def __repr__(self):
        return f'( OR-> {self.conds} )'

    def generate_query(self):
        result = [cond.generate_query() for cond in self.conds]
        return result

expr = pp.infixNotation(operand,
    [
    (':', 2, pp.opAssoc.LEFT, ExactMatchNode),
    (pp.oneOf("> >= < <="), 2, pp.opAssoc.LEFT, ComparisonNode),
    (NOT, 1, pp.opAssoc.RIGHT, NotNode),
    (AND, 2, pp.opAssoc.LEFT, AndNode),
    (OR, 2, pp.opAssoc.LEFT, OrNode),
    ])


expr.runTests("""\
    response:success
    response:success AND extension:php OR extension:css
    response:sucess AND (extension:php OR extension:css)
    time >= 2020-01-09
    time >= 2020-01-09 AND response:success OR os:windows
    NOT reponse:success
    response:success AND NOT os:windows
    """)

打印

response:success
[<ExactMatchNode>]

response:success AND extension:php OR extension:css
[( OR-> [( AND-> [<ExactMatchNode>, <ExactMatchNode>] ), <ExactMatchNode>] )]

response:sucess AND (extension:php OR extension:css)
[( AND-> [<ExactMatchNode>, ( OR-> [<ExactMatchNode>, <ExactMatchNode>] )] )]

time >= 2020-01-09
[<ComparisonNode>]

time >= 2020-01-09 AND response:success OR os:windows
[( OR-> [( AND-> [<ComparisonNode>, <ExactMatchNode>] ), <ExactMatchNode>] )]

NOT reponse:success
[( NOT-> <ExactMatchNode> )]

response:success AND NOT os:windows
[( AND-> [<ExactMatchNode>, ( NOT-> <ExactMatchNode> )] )]