如何解析非唯一的位置模式？

Question

我有两个问题与解析有点讨厌的模式有关。下面是一些无意义的例子：

examples = [
    "",
    "red green",
    "#1# red green",
    "#1# red green <2>",
    "#1,2# red green <2,3>",
    "red green ()",
    "#1# red green (blue)",
    "#1# red green (#5# blue) <2>",
    "#1# red green (#5# blue <6>) <2>",
    "#1,2# red green (#5# blue (purple) <6>;#7# yellow <10>) <2,3>",
    "#1,2# red (maroon) green (#5# blue (purple) <6>;#7# yellow <10>) <2,3>",
]

此时我应该说我无法控制这些字符串的创建。

如您所见，基本上我想解析的每个模式都是可选的。然后是我想捕捉的不同部分。我看这些例子的结构是：

[cars] [colors] [comments] [buyers]

其中 comments 由子结构组成，可以是用分号分隔的倍数。

comments: ([cars] [colors] [buyers]; ...)

我创建了以下语法以捕获内容：

import pyparsing as pp

integer = pp.pyparsing_common.integer

car_ref = "#" + pp.Group(pp.delimitedList(integer))("cars") + "#"

buyer_ref = "<" + pp.Group(pp.delimitedList(integer))("buyers") + ">"

那么我的问题是：

是否有一种聪明的方法（也许通过定位）来区分括号中属于 colors 而不是 comments 的内容？
我已经解决了注释中嵌套括号的问题。我的策略是采用内部字符串，使用 ; 作为分隔符并将其分解。但是，我未能执行该策略。我试过的是：

sub_comment = (
    pp.Optional(car_ref) +
    pp.Group(pp.ZeroOrMore(pp.Regex(r"[^;#<>\s]")))("colors") +
    pp.Optional(buyer_ref)
)

split_comments = pp.Optional(pp.delimitedList(
    pp.Group(sub_comment)("comments*"),
    delim=";"
))


def parse_comments(original, location, tokens):
    # Strip the parentheses.
    return split_comments.transformString(original[tokens[0] + 1:tokens[2] - 1])


comments = pp.originalTextFor(pp.nestedExpr()).setParseAction(parse_comments)

当我使用它时，一切都以一个连续的字符串结束，大概是因为外部 pp.originalTextFor.

res = comments.parseString("(#5# blue (purple) <6>;#7# yellow <10>)", parseAll=True)

编辑：

以最后一个示例字符串为例，我希望得到一个如下所示的对象结构：

{
  "cars": [1, 2],
  "colors": "red (maroon) green",
  "buyers": [2, 3],
  "comments": [
    {
      "cars": [5],
      "colors": "blue (purple)",
      "buyers": [6]
    },
    {
      "cars": [7],
      "colors": "yellow",
      "buyers": [10]
    }
  ]
}

因此 colors 部分中的括号应保持顺序，就像在散文中一样。引入 comments 部分的括号，我不关心它们的顺序，也不关心个别评论的顺序。

Answer 1

我认为您已经掌握了大部分内容，您只是在努力处理递归部分，其中评论本身可以包含子结构，包括更多评论。

您将此作为您的 BNF：

structure ::= [cars] [colors] [comments] [buyers]
cars ::= '#' integer, ... '#'
buyers ::= '<' integer, ... '>'

我根据你给出的例子用这些猜测填空了：

color ::= word composed of alphas
colors ::= (color | '(' color ')' )...

comments ::= '(' structure ';' ... ')'

我采用了您对汽车和买家的定义，并添加了颜色和递归定义以供评论。然后做了一个从 BNF 到 pyparsing 表达式的非常死记硬背的转换：

integer = pp.pyparsing_common.integer

car_ref = "#" + pp.Group(pp.delimitedList(integer))("cars") + "#"
buyer_ref = "<" + pp.Group(pp.delimitedList(integer))("buyers") + ">"

# not sure if this will be sufficient for color, but it works for the given examples
color = pp.Word(pp.alphas)
colors = pp.originalTextFor(pp.OneOrMore(color | '(' + color + ')'))("colors")

# define comment placeholder so it can be used in definition of structure
comment = pp.Forward()

structure = pp.Group(pp.Optional(car_ref)
                     + pp.Optional(colors)
                     + pp.Optional(comment)("comments")
                     + pp.Optional(buyer_ref))

# now insert the definition of a comment as a delimited list of structures; this takes care of
# any nesting of comments within comments
LPAREN, RPAREN = map(pp.Suppress, "()")
comment <<= pp.Group(LPAREN + pp.Optional(pp.delimitedList(structure, delim=';')) + RPAREN)

棘手的部分是将 comment 的内容定义为 structure 的分隔列表，并使用 <<= 运算符将该定义插入到先前定义的 Forward () 占位符。

将您的示例传递给 structure.runTests() 给出（默认行为是将类似 Python 的注释视为注释，因此我们必须在使用您的特定示例调用 runTests 时禁用此功能，因为前导 ' #' 是有效的汽车介绍):

structure.runTests(examples, comment=None)

red green
[['red green']]
[0]:
  ['red green']
  - colors: 'red green'

#1# red green
[['#', [1], '#', 'red green']]
[0]:
  ['#', [1], '#', 'red green']
  - cars: [1]
  - colors: 'red green'

#1# red green <2>
[['#', [1], '#', 'red green', '<', [2], '>']]
[0]:
  ['#', [1], '#', 'red green', '<', [2], '>']
  - buyers: [2]
  - cars: [1]
  - colors: 'red green'

#1,2# red green <2,3>
[['#', [1, 2], '#', 'red green', '<', [2, 3], '>']]
[0]:
  ['#', [1, 2], '#', 'red green', '<', [2, 3], '>']
  - buyers: [2, 3]
  - cars: [1, 2]
  - colors: 'red green'

red green ()
[['red green', [[]]]]
[0]:
  ['red green', [[]]]
  - colors: 'red green'
  - comments: [[]]
    [0]:
      []

#1# red green (blue)
[['#', [1], '#', 'red green (blue)']]
[0]:
  ['#', [1], '#', 'red green (blue)']
  - cars: [1]
  - colors: 'red green (blue)'

#1# red green (#5# blue) <2>
[['#', [1], '#', 'red green', [['#', [5], '#', 'blue']], '<', [2], '>']]
[0]:
  ['#', [1], '#', 'red green', [['#', [5], '#', 'blue']], '<', [2], '>']
  - buyers: [2]
  - cars: [1]
  - colors: 'red green'
  - comments: [['#', [5], '#', 'blue']]
    [0]:
      ['#', [5], '#', 'blue']
      - cars: [5]
      - colors: 'blue'

#1# red green (#5# blue <6>) <2>
[['#', [1], '#', 'red green', [['#', [5], '#', 'blue', '<', [6], '>']], '<', [2], '>']]
[0]:
  ['#', [1], '#', 'red green', [['#', [5], '#', 'blue', '<', [6], '>']], '<', [2], '>']
  - buyers: [2]
  - cars: [1]
  - colors: 'red green'
  - comments: [['#', [5], '#', 'blue', '<', [6], '>']]
    [0]:
      ['#', [5], '#', 'blue', '<', [6], '>']
      - buyers: [6]
      - cars: [5]
      - colors: 'blue'

#1,2# red green (#5# blue (purple) <6>;#7# yellow <10>) <2,3>
[['#', [1, 2], '#', 'red green', [['#', [5], '#', 'blue (purple)', '<', [6], '>'], ['#', [7], '#', 'yellow', '<', [10], '>']], '<', [2, 3], '>']]
[0]:
  ['#', [1, 2], '#', 'red green', [['#', [5], '#', 'blue (purple)', '<', [6], '>'], ['#', [7], '#', 'yellow', '<', [10], '>']], '<', [2, 3], '>']
  - buyers: [2, 3]
  - cars: [1, 2]
  - colors: 'red green'
  - comments: [['#', [5], '#', 'blue (purple)', '<', [6], '>'], ['#', [7], '#', 'yellow', '<', [10], '>']]
    [0]:
      ['#', [5], '#', 'blue (purple)', '<', [6], '>']
      - buyers: [6]
      - cars: [5]
      - colors: 'blue (purple)'
    [1]:
      ['#', [7], '#', 'yellow', '<', [10], '>']
      - buyers: [10]
      - cars: [7]
      - colors: 'yellow'

#1,2# red (maroon) green (#5# blue (purple) <6>;#7# yellow <10>) <2,3>
[['#', [1, 2], '#', 'red (maroon) green', [['#', [5], '#', 'blue (purple)', '<', [6], '>'], ['#', [7], '#', 'yellow', '<', [10], '>']], '<', [2, 3], '>']]
[0]:
  ['#', [1, 2], '#', 'red (maroon) green', [['#', [5], '#', 'blue (purple)', '<', [6], '>'], ['#', [7], '#', 'yellow', '<', [10], '>']], '<', [2, 3], '>']
  - buyers: [2, 3]
  - cars: [1, 2]
  - colors: 'red (maroon) green'
  - comments: [['#', [5], '#', 'blue (purple)', '<', [6], '>'], ['#', [7], '#', 'yellow', '<', [10], '>']]
    [0]:
      ['#', [5], '#', 'blue (purple)', '<', [6], '>']
      - buyers: [6]
      - cars: [5]
      - colors: 'blue (purple)'
    [1]:
      ['#', [7], '#', 'yellow', '<', [10], '>']
      - buyers: [10]
      - cars: [7]
      - colors: 'yellow'

如果您使用 asDict() 将所有解析结果转换为常规 Python 字典，您将得到：

structure.runTests(examples, comment=None,
                   postParse=lambda test, results: results[0].asDict()
                   )

red green
{'colors': 'red green'}

#1# red green
{'cars': [1], 'colors': 'red green'}

#1# red green <2>
{'colors': 'red green', 'cars': [1], 'buyers': [2]}

#1,2# red green <2,3>
{'colors': 'red green', 'cars': [1, 2], 'buyers': [2, 3]}

red green ()
{'comments': [[]], 'colors': 'red green'}

#1# red green (blue)
{'cars': [1], 'colors': 'red green (blue)'}

#1# red green (#5# blue) <2>
{'colors': 'red green', 'cars': [1], 'comments': [{'cars': [5], 'colors': 'blue'}], 'buyers': [2]}

#1# red green (#5# blue <6>) <2>
{'colors': 'red green', 'cars': [1], 'comments': [{'colors': 'blue', 'cars': [5], 'buyers': [6]}], 'buyers': [2]}

#1,2# red green (#5# blue (purple) <6>;#7# yellow <10>) <2,3>
{'colors': 'red green', 'cars': [1, 2], 'comments': [{'colors': 'blue (purple)', 'cars': [5], 'buyers': [6]}, {'colors': 'yellow', 'cars': [7], 'buyers': [10]}], 'buyers': [2, 3]}

#1,2# red (maroon) green (#5# blue (purple) <6>;#7# yellow <10>) <2,3>
{'colors': 'red (maroon) green', 'cars': [1, 2], 'comments': [{'colors': 'blue (purple)', 'cars': [5], 'buyers': [6]}, {'colors': 'yellow', 'cars': [7], 'buyers': [10]}], 'buyers': [2, 3]}

如何解析非唯一的位置模式？

How to parse a non-unique positional pattern?

python

pyparsing

python-3.x