Python正则表达式：非捕获组被捕获

Question

我想到了这两个正则表达式模式

1.

\([0-9]\)\s+([^.!?]+[.!?])

2.

[.!?]\s+([A-Z].*?[.!?])

要匹配像这样的字符串中的句子：

(1) A first sentence, that always follows a number in parantheses. This is my second sentence. This is my third sentence, (...) .

感谢您的回答，我存档了括号中数字后的介绍句。我也用我的第二个正则表达式得到了第二个句子。

然而第三句没有被捕获，因为.之前被消耗了。我的目标是通过两种方法得到这些句子的起点：

通过捕获 (1)
通过识别点、空格和后面的大写字母得到任何其他句子。

如何避免第 3 句及以下句子匹配失败？

感谢您的帮助！

Answer 1

您可以使用带有 negated character class [^ 的捕获组如果您想匹配 1 个或多个数字，您可以使用 [0-9]+

\([0-9]\)\s+([^.!?]+[.!?])

\([0-9]\) 匹配括号之间的数字
\s+ 匹配 1+ 个空白字符
( 捕获 组 1
- [^.!?]+[.!?] 匹配除 .、!、? 以外的任何字符 1+ 次。然后匹配其中之一。
) 关闭群组

Regex demo | Python demo

例如

import re

regex = r"\([0-9]\)\s+([^.!?]+[.!?])"
test_str = "(1) This is my first sentence, it has to be captured. This is my second sentence."

print(re.findall(regex, test_str))

输出

['This is my first sentence, it has to be captured.']

如果您还想匹配其他句子并能够区分第一个句子和其他句子，您可以使用另一个捕获组的交替

(?:\([0-9]\)\s+([^.!?]+[.!?])|([A-Z].*?\.)(?: |$))

Regex demo

Answer 2

您可以使用现有的正则表达式，只需在句子部分 (.*?[.!?]) 周围放置一个组，然后从 re.match:

的输出中获取组 1

import re

para = '(1) This is my first sentence, it has to be captured. This is my second sentence.'
print(re.search(r'\([0-9]\)\s+(.*?[.!?])', para).group(1))

输出：

This is my first sentence, it has to be captured.

Answer 3

您有多种选择可以做到这一点。第一个是lookbehind。您应该将 ':' 替换为 '<='。不幸的是，它不支持可变长度模式。所以只允许一个 space

ss='(1) This is my first sentence, it has to be captured. This is my second sentence.'

re.search(r'(?<=\([0-9]\)\s).*?[.!?]', ss).group(0)

输出：

'This is my first sentence, it has to be captured.'

您也可以搜索群：

re.search(r'\([0-9]\)\s+(.*?[.!?])', ss).group(1)

输出：

'This is my first sentence, it has to be captured.'

它允许可变长度模式

这两个选项都对您的原始模式进行了最少的修改。

Python正则表达式：非捕获组被捕获

Python Regex: non capturing group is captured

python

regex

regex-group