使用 pyparsing,如何对与 OneOrMore(expre1|expr2) 匹配的表达式进行分组?

Using pyparsing, how can I group expressions that are matched by OneOrMore(expre1|expr2)?

我的网站接收到允许用户 post 一个字符串,其中包含多个问题,后跟多项选择答案。有一个强制性的风格指南,允许通过 Regex 解析结果,然后将问题 + MCQ 选择存储在数据库中,稍后在随机练习考试中返回。

我想过渡到 pyparsing,因为正则表达式不是立即可读的,我觉得有点被它锁定了。我希望可以选择轻松扩展我的问题分析器的功能,而使用 Regex 感觉非常麻烦。

用户输入的形式为:

quiz = [<question-answer>, <q-start>]
<question-answer> = <question> + <answer>
<question> = [<q-text>, \n] ?!= <a-start>
<answer> = [<answer>, <a-start>]  ?!= <q-start>
<q-start> = <nums> + "." | ")"
<a-start> = <alphas> + "." | ")" 

长的用户输入字符串被分成问答,由下一个问答组的 q-start 分隔。 问题都是 q-start 和 a-start 之间的文本。 答案是 a-start 和 a-start 或以下 q-start 之间的所有文本的列表。

示例文本:

3. A lesion that affects N. Solitarius will result in the patient having problems related to:
a. taste and blood pressure regulation
c. swallowing and respiration
b. smell and taste
d. voice quality and taste
e. whistling and chewing

4. A patient comes to your office complaining of weakness on the right side of their body. You notice that their head is
turned slightly to the left and their right shoulder droops. When asked to protrude their tongue, it deviates to the right. Eye
movements and eye-related reflexes appear to be normal. The lesion most likely is located in the:
c. left ventral medulla
a. left ventral midbrain
b. right dorsal medulla
d. left ventral pons
e. right ventral pons

5. A colleague {...}

我一直在使用的正则表达式:

# matches a question-answer block. Matching q-start until an empty line.
regex1 = r"(^[\t ]*[0-9]+[\)\.][\t ]+[\s\S]*?(?=^[\n\r]))" 

# Within question-answer block, matches everything that does not start with a-start
regex6 = r"(^(?!(^[a-fA-F][\)\.]\s+[\s\S]+)).*)"

# Matches all text between a-start and the following a-start, or until the question-answer substring block ends.
regex5 = r"(^[a-fA-F][\)\.]\s+[\s\S]+)"       

然后一点点 python 并重新 trim 去掉问题编号,mcq 字母,加入所有有问题的断线,将 MCQ 追加到列表中。

在pyparsing中我试过这个:

EOL = Suppress(LineEnd())
delim = oneOf(". )")
q_start = LineStart() + Word(nums) + delim
a_start = LineStart() + Char(alphas) + delim

question = Optional(EOL) + Group(Suppress(q_start) + OneOrMore(SkipTo(LineEnd()) + EOL, stopOn=a_start)).setResultsName('question', listAllMatches=True)

answer = Optional(EOL) + Group(Suppress(a_start) + OneOrMore( SkipTo(LineEnd()) + EOL, stopOn=(a_start | q_start | StringEnd()))).setResultsName('answer', listAllMatches=True)



qi = Group(OneOrMore(question|answer)).setResultsName('group', listAllMatches=True)
t = qi.parseString(test)
print(t.dump())

结果:

[[['The tectum of the midbrain comprises the:'], ['superior and inferior colliculi'], ['reticular formation'], ['internal arcuate fibers'], ['cerebellar peduncles'], ['pyramids'], ['Damage to the dorsal columns on one side of the spinal cord would results in:'], ['loss of MVP ipsilaterally below the level of the lesion'], ['hypertonicity of the contralateral limbs'], ['loss of pain and temperature contralaterally below the level of the lesion'], ['loss of MVP contralaterally above the level of the lesion'], ['loss of pain and temperature ipsilaterally above the level of the lesion']]]
- group: [[['The tectum of the midbrain comprises the:'], ['superior and inferior colliculi'], ['reticular formation'], ['internal arcuate fibers'], ['cerebellar peduncles'], ['pyramids'], ['Damage to the dorsal columns on one side of the spinal cord would results in:'], ['loss of MVP ipsilaterally below the level of the lesion'], ['hypertonicity of the contralateral limbs'], ['loss of pain and temperature contralaterally below the level of the lesion'], ['loss of MVP contralaterally above the level of the lesion'], ['loss of pain and temperature ipsilaterally above the level of the lesion']]]
  [0]:
    [['The tectum of the midbrain comprises the:'], ['superior and inferior colliculi'], ['reticular formation'], ['internal arcuate fibers'], ['cerebellar peduncles'], ['pyramids'], ['Damage to the dorsal columns on one side of the spinal cord would results in:'], ['loss of MVP ipsilaterally below the level of the lesion'], ['hypertonicity of the contralateral limbs'], ['loss of pain and temperature contralaterally below the level of the lesion'], ['loss of MVP contralaterally above the level of the lesion'], ['loss of pain and temperature ipsilaterally above the level of the lesion']]
    - answer: [['superior and inferior colliculi'], ['reticular formation'], ['internal arcuate fibers'], ['cerebellar peduncles'], ['pyramids'], ['loss of MVP ipsilaterally below the level of the lesion'], ['hypertonicity of the contralateral limbs'], ['loss of pain and temperature contralaterally below the level of the lesion'], ['loss of MVP contralaterally above the level of the lesion'], ['loss of pain and temperature ipsilaterally above the level of the lesion']]
      [0]:
        ['superior and inferior colliculi']
      [1]:
        ['reticular formation']
      [2]:
        ['internal arcuate fibers']
      [3]:
        ['cerebellar peduncles']
      [4]:
        ['pyramids']
      [5]:
        ['loss of MVP ipsilaterally below the level of the lesion']
      [6]:
        ['hypertonicity of the contralateral limbs']
      [7]:
        ['loss of pain and temperature contralaterally below the level of the lesion']
      [8]:
        ['loss of MVP contralaterally above the level of the lesion']
      [9]:
        ['loss of pain and temperature ipsilaterally above the level of the lesion']
    - question: [['The tectum of the midbrain comprises the:'], ['Damage to the dorsal columns on one side of the spinal cord would results in:']]
      [0]:
        ['The tectum of the midbrain comprises the:']
      [1]:
        ['Damage to the dorsal columns on one side of the spinal cord would results in:']

匹配问题和答案,并正确绕过可能打断问题或答案的换行符。我遇到的问题是它们没有按照我预期的方式分组。 我期待的是 组[0] = 问题,答案[1:4] 组[2] = 问题,答案[1:4]

有人有什么建议吗?

谢谢!

我认为你走在正确的轨道上 - 我对你的解析器进行了单独的检查,得出了非常相似的结构,但只有一些不同之处。

question = Combine(q_start.suppress() + SkipTo(EOL + a_start))
answer = Combine(a_start.suppress() + SkipTo(EOL + (a_start | q_start | StringEnd())))
q_a = Group(question("question") + answer[1, ...]("answers"))

for t in q_a[...].parseString(test):
    print(t.dump())

最大的区别是我用来解析你的文本的表达式不只是做 OneOrMore(question | answer),而是定义了一个 Group(question + OneOrMore(answer))。这会为每个问题及其相关答案创建一个组。在您的解析器中,使用 listAllMatches 只会为所有问题创建一个结果名称,为所有答案创建另一个结果名称,但会丢失它们之间的所有关联。通过创建“问题+一个或多个答案”组,这些关联得以维持。

如果您想删除 '\n',与使用 EOL 业务相比,使用解析操作可以更轻松地做到这一点。