如何使用 python re 在文本中查找包含短语的句子？

Question

我有一些文本是句子，一些是问题。我正在尝试创建一个正则表达式，它将仅提取包含特定短语的问题，即 'NSF' :

import re
s = "This is a string. Is this a question? This isn't a question about NSF. Is this one about NSF? This one is a question about NSF but is it longer?"

理想情况下，re.findall 会 return:

['Is this one about NSF?','This one is a question about NSF but is it longer?']

但我目前最好的尝试是：

re.findall('([\.\?].*?NSF.*\?)+?',s)
[". Is this a question? This isn't a question about NSF. Is this one about NSF? This one is a question about NSF but is it longer?"]

我知道我需要做一些非贪婪的事情，但我不确定我哪里搞砸了。

Answer 1

免责声明：答案不是针对通用的疑问句拆分解决方案，而是展示OP提供的字符串如何与正则表达式匹配。最好的解决方案是将文本标记为带有 nltk and parse sentences (see this thread) 的句子。

您可能希望对字符串使用的正则表达式（例如您发布的那个）是基于匹配所有不是最终标点符号的字符，然后匹配您希望出现在句子中的子字符串，然后匹配那些字符以外的字符又是最后的标点符号。要否定单个字符，请使用否定字符类.

\s*([^!.?]*?NSF[^!.?]*?[?])

参见regex demo。

详情:

\s* - 0+ 个空格
([^!.?]*?NSF[^.?]*?[?]) - 第 1 组捕获
- [^!.?]*? - 除了 .、! 和 ? 之外的 0+ 个字符，尽可能少
- NSF - 您需要出现的值，一个字符序列 NSF
- [^.?]*? - 同上
- [?] - 文字 ?（可以替换为 \?）

如何使用 python re 在文本中查找包含短语的句子？

How to find a sentence containing a phrase in text using python re?

python

regex

non-greedy