匹配百灵子中的子串

Match substrings in lark

如何使用 Lark 正确匹配子字符串?

我的意图(也许不是 possible/advisable 与 Lark 或任何 CFG)是只匹配和解析字符串的 重要 部分,忽略其余部分。例如:

这是我的代码:

from lark import Lark, tree

grammar = """
    input: not_important* important not_important*

    important: one_person
        | two_people

    one_person: PERSON
    two_people: one_person conj one_person
    not_important: RANDOM_WORD

    conj: CONJ

    PERSON: "John" | "Mary"
    CONJ: "and" | "or"
    RANDOM_WORD: /\w+/

    %import common.WS
    %ignore WS
"""

if __name__ == '__main__':
    parser = Lark(grammar, start='input', ambiguity='explicit')
    tree = parser.parse('Yesterday John and Mary kissed')
    print(tree.pretty())

有效:

但是当不重要的东西包围重要的东西时,它就不起作用了,例如。 "Yesterday John and Mary kissed"。在这个例子中,我希望得到:

input
    not_important   Yesterday
    important
      two_people
        one_person  John
        conj    and
        one_person  Mary
    not_important   kissed

但我得到:

_ambig
  input
    not_important   Yesterday
    important
      one_person    John
    not_important   and
    not_important   Mary
    not_important   kissed
  input
    not_important   Yesterday
    important
      two_people
        one_person  John
        conj    and
        one_person  Mary
    not_important   kissed
    not_important   John
    not_important   and

也就是说,Lark 不仅认为输入有歧义,而且第二次解析也失败,因为两个终端("John" 和 "and")被消耗了两次。

解决此问题的一种方法是为您的某些规则设置更高的优先级:

from lark import Lark, tree

grammar = """
    input: not_important* important not_important* 

    important.2: one_person
        | two_people

    one_person: PERSON
    two_people.2: one_person conj one_person
    not_important: RANDOM_WORD

    conj.2: CONJ

    PERSON: "John" | "Mary"
    CONJ: "and" | "or"
    RANDOM_WORD: /\w+/

    %import common.WS
    %ignore WS
"""

if __name__ == '__main__':
    parser = Lark(grammar, start='input')
    tree = parser.parse('Yesterday John and Mary kissed')
    print(tree.pretty())

注意.2添加到importanttwo_peopleconj,这就是你在lark中设置优先级的方式。默认情况下,所有规则的优先级均为 1。 设置优先级后,您可以删除 ambiguity='explicit',因为 lark 将能够正确处理歧义。

第二种方法是使用lalr作为解析器。这允许您在终端上设置优先级。那么我们可以将PERSONCONJ的优先级设置为2。这样语法就没有歧义了,这是使用lalr

的要求

from lark import Lark, tree

grammar = """
    input: not_important* important not_important* 

    important: one_person
        | two_people

    one_person: PERSON
    two_people: one_person conj one_person
    not_important: RANDOM_WORD

    conj: CONJ

    PERSON.2: "John" | "Mary"
    CONJ.2: "and" | "or"
    RANDOM_WORD: /\w+/

    %import common.WS
    %ignore WS
"""

if __name__ == '__main__':
    parser = Lark(grammar, start='input', parser="lalr")
    tree = parser.parse('Yesterday John and Mary kissed')
    print(tree.pretty())

两种方法输出

input
  not_important Yesterday
  important
    two_people
      one_person        John
      conj      and
      one_person        Mary
  not_important kissed