如何设置可以处理歧义的语法

Question

我正在尝试创建一个语法来解析我设计的一些类似于 Excel 的公式，其中字符串开头的特殊字符表示不同的来源。例如，$可以表示一个字符串，所以“$This is text”在程序中会被当作一个字符串输入，而&可以表示一个函数，所以&foo()可以是被视为对内部函数 foo.

的调用

我面临的问题是如何正确构造语法。例如，这是作为 MWE 的简化版本：

grammar = r'''start: instruction

?instruction: simple
            | func

STARTSYMBOL: "!"|"#"|"$"|"&"|"~"
SINGLESTR: (LETTER+|DIGIT+|"_"|" ")*
simple: STARTSYMBOL [SINGLESTR] (WORDSEP SINGLESTR)*
ARGSEP: ",," // argument separator
WORDSEP: "," // word separator
CONDSEP: ";;" // condition separator
STAR: "*"
func: STARTSYMBOL SINGLESTR "(" [simple|func] (ARGSEP simple|func)* ")"

%import common.LETTER
%import common.WORD
%import common.DIGIT
%ignore ARGSEP
%ignore WORDSEP
'''
parser = lark.Lark(grammar, parser='earley')

因此，使用此语法，诸如 $This is a string、&foo()、&foo(#arg1)、&foo($arg1,,#arg2) 和 &foo(!w1,w2,w3,,!w4,w5,w6) 之类的内容都会按预期进行解析。但是，如果我想为我的 simple 终端增加更多灵活性，那么我需要开始摆弄 SINGLESTR 令牌定义，这并不方便。

我尝试了什么

我无法通过的部分是，如果我想要一个包含括号的字符串（这是 func 的文字），那么在我目前的情况下我无法处理它们。

如果我在 SINGLESTR 中添加括号，那么我会得到 Expected STARTSYMBOL，因为它与 func 定义混淆了，它认为应该传递一个函数参数，这是有道理的。
如果我重新定义语法以仅为函数保留和符号并在 SINGLESTR 中添加括号，那么我可以解析带括号的字符串，但我尝试解析的每个函数都会给出 Expected LPAR.

我的意图是，任何以 $ 开头的内容都将被解析为 SINGLESTR 标记，然后我可以解析诸如 &foo($first arg (has) parentheses,,$second arg).

之类的内容

目前，我的解决方案是在我的字符串中使用 'escape' 词，如 LEFTPAR 和 RIGHTPAR，并且我编写了辅助函数以在处理树时将它们更改为括号。因此，$This is a LEFTPARtestRIGHTPAR 生成了正确的树，当我处理它时，它被转换为 This is a (test)。

提出一个一般性问题：我能否以这样一种方式定义我的语法，即某些特定于语法的字符在某些情况下被视为普通字符，而在其他情况下被视为特殊字符？

编辑 1

根据 jbndlr 的评论，我修改了语法以根据开始符号创建单独的模式：

grammar = r'''start: instruction

?instruction: simple
            | func

SINGLESTR: (LETTER+|DIGIT+|"_"|" ") (LETTER+|DIGIT+|"_"|" "|"("|")")*
FUNCNAME: (LETTER+) (LETTER+|DIGIT+|"_")* // no parentheses allowed in the func name
DB: "!" SINGLESTR (WORDSEP SINGLESTR)*
TEXT: "$" SINGLESTR
MD: "#" SINGLESTR
simple: TEXT|DB|MD
ARGSEP: ",," // argument separator
WORDSEP: "," // word separator
CONDSEP: ";;" // condition separator
STAR: "*"
func: "&" FUNCNAME "(" [simple|func] (ARGSEP simple|func)* ")"

%import common.LETTER
%import common.WORD
%import common.DIGIT
%ignore ARGSEP
%ignore WORDSEP
'''

这（有点）属于我的第二个测试用例。我可以解析所有 simple 类型的字符串（可以包含括号的 TEXT、MD 或 DB 标记）和空函数；例如，&foo() 或 &foo(&bar()) 解析正确。当我在函数中放入一个参数（无论哪种类型）时，我得到一个 UnexpectedEOF Error: Expected ampersand, RPAR or ARGSEP。作为概念证明，如果我在上面的新语法中从 SINGLESTR 的定义中删除括号，那么一切正常，但我回到原点。

Answer 1

问题是函数的参数包含在括号中，其中一个参数可能包含括号。
一种可能的解决方案是在 ( 或 ) 是 String

的一部分之前使用退格键

  SINGLESTR: (LETTER+|DIGIT+|"_"|" ") (LETTER+|DIGIT+|"_"|" "|"\("|"\)")*

C 使用的类似解决方案，包括双引号 (") 作为字符串常量的一部分，其中字符串常量用双引号引起来。

  example_string1='&f(!g\()'
  example_string2='&f(#g)'
  print(parser.parse(example_string1).pretty())
  print(parser.parse(example_string2).pretty())

输出是

   start
     func
       f
       simple   !g\(

   start
     func
      f
      simple    #g

Answer 2

import lark
grammar = r'''start: instruction

?instruction: simple
            | func

MIDTEXTRPAR: /\)+(?!(\)|,,|$))/
SINGLESTR: (LETTER+|DIGIT+|"_"|" ") (LETTER+|DIGIT+|"_"|" "|"("|MIDTEXTRPAR)*
FUNCNAME: (LETTER+) (LETTER+|DIGIT+|"_")* // no parentheses allowed in the func name
DB: "!" SINGLESTR (WORDSEP SINGLESTR)*
TEXT: "$" SINGLESTR
MD: "#" SINGLESTR
simple: TEXT|DB|MD
ARGSEP: ",," // argument separator
WORDSEP: "," // word separator
CONDSEP: ";;" // condition separator
STAR: "*"
func: "&" FUNCNAME "(" [simple|func] (ARGSEP simple|func)* ")"

%import common.LETTER
%import common.WORD
%import common.DIGIT
%ignore ARGSEP
%ignore WORDSEP
'''

parser = lark.Lark(grammar, parser='earley')
parser.parse("&foo($first arg (has) parentheses,,$second arg)")

输出：

Tree(start, [Tree(func, [Token(FUNCNAME, 'foo'), Tree(simple, [Token(TEXT, '$first arg (has) parentheses')]), Token(ARGSEP, ',,'), Tree(simple, [Token(TEXT, '$second arg')])])])

希望这就是您要找的。

这几天很疯狂。我试过百灵鸟但失败了。我还尝试了 persimonious 和 pyparsing。所有这些不同的解析器都存在相同的问题，即 'argument' 标记使用了作为函数一部分的右括号，最终失败，因为函数的括号没有闭合。

诀窍是弄清楚如何定义 "not special" 的右括号。请参阅上面代码中 MIDTEXTRPAR 的正则表达式。我将其定义为右括号，后面没有参数分隔符或字符串结尾。我通过使用正则表达式扩展名 (?!...) 来做到这一点，该扩展名仅在其后没有 ... 但不消耗字符时才匹配。幸运的是，它甚至允许在这个特殊的正则表达式扩展中匹配字符串的结尾。

编辑：

上面提到的方法只有在你没有以 ) 结尾的参数时才有效，因为 MIDTEXTRPAR 正则表达式不会捕捉到那个 ) 并且会认为这是函数的结尾，即使还有更多要处理的参数。此外，可能存在歧义，例如 ...asdf),...，它可能是参数内部函数声明的结尾，或者参数内部的 'text-like' ) 并且函数声明继续。

这个问题与你在问题中描述的不是上下文无关文法有关(https://en.wikipedia.org/wiki/Context-free_grammar) for which parsers such as lark exist. Instead it is a context-sensitive grammar (https://en.wikipedia.org/wiki/Context-sensitive_grammar)。

它是上下文相关语法的原因是因为你需要解析器 'remember' 它嵌套在一个函数中，嵌套有多少层，并且在语法的某种语法。

编辑 2：

另请查看以下上下文相关的解析器，它似乎可以解决问题，但嵌套函数的数量具有指数级的时间复杂度，因为它会尝试解析所有可能的函数障碍，直到它找到一个有效的。我相信它必须具有指数复杂度，因为它不是上下文无关的。


_funcPrefix = '&'
_debug = False

class ParseException(Exception):
    pass

def GetRecursive(c):
    if isinstance(c,ParserBase):
        return c.GetRecursive()
    else:
        return c

class ParserBase:
    def __str__(self):
        return type(self).__name__ + ": [" + ','.join(str(x) for x in self.contents) +"]"
    def GetRecursive(self):
        return (type(self).__name__,[GetRecursive(c) for c in self.contents])

class Simple(ParserBase):
    def __init__(self,s):
        self.contents = [s]

class MD(Simple):
    pass

class DB(ParserBase):
    def __init__(self,s):
        self.contents = s.split(',')

class Func(ParserBase):
    def __init__(self,s):
        if s[-1] != ')':
            raise ParseException("Can't find right parenthesis: '%s'" % s)
        lparInd = s.find('(')
        if lparInd < 0:
            raise ParseException("Can't find left parenthesis: '%s'" % s)
        self.contents = [s[:lparInd]]
        argsStr = s[(lparInd+1):-1]
        args = list(argsStr.split(',,'))
        i = 0
        while i<len(args):
            a = args[i]
            if a[0] != _funcPrefix:
                self.contents.append(Parse(a))
                i += 1
            else:
                j = i+1
                while j<=len(args):
                    nestedFunc = ',,'.join(args[i:j])
                    if _debug:
                        print(nestedFunc)
                    try:
                        self.contents.append(Parse(nestedFunc))
                        break
                    except ParseException as PE:
                        if _debug:
                            print(PE)
                        j += 1
                if j>len(args):
                    raise ParseException("Can't parse nested function: '%s'" % (',,'.join(args[i:])))
                i = j

def Parse(arg):
    if arg[0] not in _starterSymbols:
        raise ParseException("Bad prefix: " + arg[0])
    return _starterSymbols[arg[0]](arg[1:])

_starterSymbols = {_funcPrefix:Func,'$':Simple,'!':DB,'#':MD}

P = Parse("&foo($first arg (has)) parentheses,,&f($asdf,,&nested2(423))),,&second(!arg,wer))")
print(P)

import pprint
pprint.pprint(P.GetRecursive())

如何设置可以处理歧义的语法

How to setup a grammar that can handle ambiguity

python

grammar

lark-parser

我尝试了什么

编辑 1