基于多个正则表达式匹配拆分字符串

Question

首先，我查看了这些以前的帖子，并没有帮助我。 1 & 2 &
我有这个字符串（或类似的情况可能是）需要用正则表达式处理：

"Text Table 6-2: Management of children study and actions"

我应该做的是检测单词 Table 和之前的单词（如果存在的话）
检测后面的数字可以是这样的格式：6 or 6-2 or 66-22 or 66-2
最后是字符串的其余部分（在本例中：儿童学习和行动管理）

这样做之后，return值必须是这样的：

return 1 and 2 as one string, the rest as another string
e.g. returned value must look like this: Text Table 6-2, Management of children study and actions

下面是我的代码：

mystr = "Text Table 6-2:    Management of children study and actions"


if re.match("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?", mystr):
    print("True matched")
    parts_of_title = re.search("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?", mystr)
    print(parts_of_title)
    print(" ".join(parts_of_title.group().split()[0:3]), parts_of_title.group().split()[-1])

第一个要求 return 应该是正确的，但第二个要求不是这样，我更改了代码并使用了 compile 但是 regex 功能发生了变化，代码是这样的：

mystr = "Text Table 6-2:    Management of children study and actions"


if re.match("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?", mystr):
    print("True matched")
    parts_of_title = re.compile("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?").split(mystr)
    print(parts_of_title)

输出：

True matched
['', 'Text ', 'Table', '-2', ':\tManagement of children study and actions']

那么基于此，我如何才能做到这一点并坚持使用干净且可读的代码？为什么使用 compile 会改变匹配？
提前致谢

Answer 1

匹配发生变化，因为：

在第一部分中，您调用 .group().split() 其中 .group() returns 是字符串的完整匹配项。
在第二部分中，您调用 re.compile("...").split() 其中 re.compile returns 一个正则表达式对象。

在模式中，这部分将只匹配一个单词 [a-zA-Z0-9]+[ ]，如果这部分应该在捕获组中 [0-9]([-][0-9]+)? 第一（单个）数字目前不是捕获组。

您可以编写包含 4 个捕获组的模式：

^(.*? )?((?:[Ll]ist|[Tt]able|[Ff]igure))\s+(\d+(?:-\d+)?):\s+(.+)

看到一个regex demo.

import re

pattern = r"^(.*? )?((?:[Ll]ist|[Tt]able|[Ff]igure))\s+(\d+(?:-\d+)?):\s+(.+)"
s = "Text Table 6-2:    Management of children study and actions"
m = re.match(pattern, s)
if m:
    print(m.groups())

输出

('Text ', 'Table', '6-2', 'Management of children study and actions')

如果你想点 1 和点 2 作为一个字符串，那么你可以使用 2 个捕获组。

^((?:.*? )?(?:[Ll]ist|[Tt]able|[Ff]igure)\s+\d+(?:-\d+)?):\s+(.+)

Regex demo

输出将是

('Text Table 6-2', 'Management of children study and actions')

Answer 2

你已经有了答案，但我想尝试你的问题来训练自己，所以如果你有兴趣，我会把我发现的都给你：

((?:[a-zA-Z0-9]+)? ?(?:[Ll]ist|[Tt]able|[Ff]igure)).*?((?:[0-9]+\-[0-9]+)|(?<!-)[0-9]+): (.*)

这是我测试的link：https://regex101.com/r/7VpPM2/1

基于多个正则表达式匹配拆分字符串

Split String based on multiple Regex matches

python

regex

string

split