用于在定界符后获取多个单词的正则表达式

Question

我一直在尝试使用 PCRE 中的正则表达式从以下字符串中获取单独的组：

drop = blah blah blah something keep = bar foo nlah aaaa rename = (a=b d=e) obs=4 where = (foo > 45 and bar == 35)

我想创建的群组是这样的：

1. drop = blah blah blah something
2. keep = bar foo nlah aaaa
3. rename = (a=b d=e)
4. obs=4
5. where = (foo > 45 and bar == 35)

我已经使用递归编写了一个正则表达式，但由于某种原因，递归部分地用于在 drop 之后选择多个单词，就像它只选择前 3 个单词（blah blah blah）而不是第 4 个。我查看了各种 Whosebug 问题，也尝试过使用正面前瞻，但这是我能做的最接近的，现在我被卡住了，因为我无法理解我做错了什么。

同样可以在这里看到：RegEx Demo.

感谢任何对此的帮助或理解我做错了什么。

Answer 1

您可以将较新的 regex 模块与 DEFINE 一起使用：

(?(DEFINE)
    (?<key>\w+)
    (?<sep>\s*=\s*)
    (?<value>(?:(?!(?&key)(?&sep))[^()=])+)
    (?<par>\((?:[^()]+|(?&par))+\))
)
(?P<k>(?&key))(?&sep)(?P<v>(?:(?&value)|(?&par)))

见a demo on regex101.com。

在 Python 这可能是：

import regex as re

data = """
drop = blah blah blah something keep = bar foo nlah aaaa rename = (a=b d=e) obs=4 where = (foo > 45 and bar == 35)
"""

rx = re.compile(r'''
(?(DEFINE)
    (?<key>\w+)
    (?<sep>\s*=\s*)
    (?<value>(?:(?!(?&key)(?&sep))[^()=])+)
    (?<par>\((?:[^()]+|(?&par))+\))
)

(?P<k>(?&key))(?&sep)(?P<v>(?:(?&value)|(?&par)))''', re.X)

result = {m.group('k').strip(): m.group('v').strip()
          for m in rx.finditer(data)}

print(result)

并产生

{'drop': 'blah blah blah something', 'keep': 'bar foo nlah aaaa', 'rename': '(a=b d=e)', 'obs': '4', 'where': '(foo > 45 and bar == 35)'}

Answer 2

您可以使用 branch reset group 解决方案：

(?i)\b(drop|keep|where|rename|obs)\s*=\s*(?|(\w+(?:\s+\w+)*)(?=\s+\w+\s+=|$)|\((.*?)\))

见PCRE regex demo

详情

(?i) - 不区分大小写模式开启
\b - 单词边界
(drop|keep|where|rename|obs) - 第 1 组：组中的任何单词
\s*=\s* - = 包含 0+ 个空白字符的字符
(?| - 分支重置组的开始：
- (\w+(?:\s+\w+)*) - 第 2 组：一个或多个单词字符后跟一个或多个空格和一个或多个单词字符的零次或多次重复
- (?=\s+\w+\s+=|$) - 最多一个或多个空格、一个或多个单词字符、一个或多个空格和 =，或字符串结尾
- | - 或者
  - $(.*?)$ - (，然后第 2 组捕获除换行符以外的任何零个或多个字符，尽可能少，然后 )
) - 分支重置组结束。

见Python demo:

import regex
pattern = r"(?i)\b(drop|keep|where|rename|obs)\s*=\s*(?|(\w+(?:\s+\w+)*)(?=\s+\w+\s+=|$)|\((.*?)\))"
text = "drop = blah blah blah something keep = bar foo nlah aaaa rename = (a=b d=e) obs=4 where = (foo > 45 and bar == 35)"
print( [x.group() for x in regex.finditer(pattern, text)] )
# => ['drop = blah blah blah something', 'keep = bar foo nlah aaaa', 'rename = (a=b d=e)', 'obs=4', 'where = (foo > 45 and bar == 35)']
print( regex.findall(pattern, text) )
# => [('drop', 'blah blah blah something'), ('keep', 'bar foo nlah aaaa'), ('rename', 'a=b d=e'), ('obs', '4'), ('where', 'foo > 45 and bar == 35')]

用于在定界符后获取多个单词的正则表达式

Regex for getting multiple words after a delimiter

regex

python-3.x

regex-recursion