如何使用带有特定标记的 pyparsing 拆分文本？

Question

请注意： 在中，它是关于如何在 \n 行的末尾使用单个标记来解析文件，这非常容易。我的问题有所不同，因为我很难忽略在 : 之前开始的最后一个文本，并将其从过滤器之前输入的自由文本搜索中排除。

在我们的 API 上，我有一个像 some free text port:45 title:welcome to our website 这样的用户输入，在解析结束时我需要的是 2 个部分 -> [some free text，port:45 title:welcome]

from pyparsing import *
token = "some free text port:45 title:welcome to our website"
t = Word(alphas, " "+alphanums) + Word(" "+alphas,":"+alphanums)

这确实给我一个错误：

pyparsing.ParseException: Expected W:( ABC..., :ABC...), found ':'  (at char 21), (line:1, col:22)

因为它获取所有字符串直到 some free text port 然后是 :45 title:welcome to our website.

如何使用 pyparsing 在单独的组中获取 port: 之前的所有数据并在另一个组中获取 port:.... 之前的所有数据？

Answer 1

我知道问题是关于 pyparsing，但对于具体用途，我认为使用正则表达式更标准和更简单，而 pyparsing 可能更适合更复杂的解析问题。

这里是一个可能的工作正则表达式： ^(.+port\:\d+) (title:.+)$

这里是 python 代码：

import re
pattern = "^(.+port\:\d+) (title:.+)$"
token = "some free text port:45 title:welcome to our website"
m = re.match(pattern, token)
if m:
    grp1, grp2 = m.group(1), m.group(2)

Answer 2

将“”添加为 Word 中的有效字符之一几乎总是会出现此问题，因此通常是 pyparsing 反模式。 Word 在其 parse() 方法中进行字符重复匹配，因此无法添加任何类型的前瞻。

要在表达式中获得空格，您可能需要一个 OneOrMore，包装在 originalTextFor 中，如下所示：

import pyparsing as pp

word = pp.Word(pp.printables, excludeChars=":")

non_tag = word + ~pp.FollowedBy(":")

# tagged value is two words with a ":"
tag = pp.Group(word + ":" + word)

# one or more non-tag words - use originalTextFor to get back 
# a single string, including intervening white space
phrase = pp.originalTextFor(non_tag[1, ...])

parser = (phrase | tag)[...]

parser.runTests("""\
    some free text port:45 title:welcome to our website
    """)

打印：

some free text port:45 title:welcome to our website
['some free text', ['port', ':', '45'], ['title', ':', 'welcome'], 'to our website']
[0]:
  some free text
[1]:
  ['port', ':', '45']
[2]:
  ['title', ':', 'welcome']
[3]:
  to our website

如何使用带有特定标记的 pyparsing 拆分文本？

How can I split text using pyparsing with a specific token?

pyparsing

python-3.x