用于解析带有引号子字符串和 return 单个单词嵌套列表的查询的正则表达式

Question

我正在尝试编写一个正则表达式，它接收包含引号子字符串的单词字符串，例如 "green lizards" like to sit "in the sun"，将其标记为单词和用空格分隔的引号子字符串（使用单引号或双引号），并且然后 returns 一个列表 [['green', 'lizards'], 'like', 'to', 'sit', ['in', 'the', 'sun']]，其中列表项是单个单词或遇到带引号的子字符串的嵌套单词列表。

我是正则表达式的新手，能够找到捕获引用部分的解决方案：re.findall('"([^"]*)"', '"green lizards" like to sit "in the sun"') ... returns：['green lizards', 'in the sun']

但这不会捕获单个单词，也不会标记它们（返回单个字符串而不是单词列表，这需要我分别 split() 它们。

我如何制作一个正确 returns 我想要的列表类型的正则表达式？此外，如果有人有建议，我愿意 methods/tools 比正则表达式更好地解析这些类型的字符串。

谢谢！

Answer 1

您可以使用 re.split 然后是最后一个 str.split:

import re
s = '"green lizards" like to sit "in the sun"'
new_s = [[i[1:-1].split()] if i.startswith('"') else i.split() for i in re.split('(?<=")\s|\s(?=")', s)]
last_result = [i for b in new_s for i in b]

输出：

[['green', 'lizards'], 'like', 'to', 'sit', ['in', 'the', 'sun']]

Answer 2

使用re.findall()函数和内置str方法：

import re

s = '"green lizards" like to sit "in the sun"'
result = [i.replace('"', "").split() if i.startswith('"') else i
          for i in re.findall(r'"[^"]+"|\S+', s)]

print(result)

输出：

[['green', 'lizards'], 'like', 'to', 'sit', ['in', 'the', 'sun']]

Answer 3

另一种方法（同时支持单引号和双引号）：

import re

sentence = """"green lizards" like to sit "in the sun" and 'single quotes' remain alone"""

rx = re.compile(r"""(['"])(.*?)|\S+""")

tokens = [m.group(2).split() 
            if m.group(2) else m.group(0) 
            for m in rx.finditer(sentence)]
print(tokens)

屈服

[['green', 'lizards'], 'like', 'to', 'sit', ['in', 'the', 'sun'], 'and', ['single', 'quotes'], 'remain', 'alone']

这里的思路是：

(['"]) # capture a single or a double quote
(.*?)  # 0+ characters lazily
     # up to the same type of quote previously captured
|      # ...or...
\S+    # not a whitespace

在列表理解中，我们检查满足了哪个条件。

用于解析带有引号子字符串和 return 单个单词嵌套列表的查询的正则表达式

Regex to parse queries with quoted substrings and return nested lists of individual words

python

regex

quotes

parsing

substring