查找包含特定单词的所有句子

Find all sentences containing specific words

我有一个由句子组成的字符串,我想找到包含至少一个特定关键字的所有句子,即 keyword1keyword2:

import re

s = "This is a sentence which contains keyword1. And keyword2 is inside this sentence. "

pattern = re.compile(r"([A-Z][^\.!?].*(keyword1)|(keyword2).*[\.!?])\s")
for match in pattern.findall(s):
    print(match)

输出:

('This is a sentence which contains keyword1', 'keyword1', '')
('keyword2 is inside this sentence. ', '', 'keyword2')

预期输出:

('This is a sentence which contains keyword1', 'keyword1', '')
('And keyword2 is inside this sentence. ', '', 'keyword2')

如您所见,第二个匹配项不包含第一组中的整个句子。我在这里错过了什么?

可以用取反符class不匹配.!?,将关键字放在同一组,防止出现空串结果。

然后 re.findall returns 捕获组值,对于整个匹配是第 1 组,对于其中一个关键字是第 2、3 组等。

([A-Z][^.!?]*(?:(keyword1)|(keyword2))[^.!?]*[.!?])\s

说明

  • ( 捕获 组 1
    • [A-Z][^.!?]* 匹配大写字符 A-Z 和可选的除 .!?
    • 之一以外的任何字符
    • (?:(keyword1)|(keyword2))抓取自己组中的关键词之一
    • [^.!?]*[.!?] 匹配除 .!? 之一之外的任何字符,然后匹配 .!?
    • 之一
  • ) 关闭组 1
  • \s 匹配空白字符

看到一个regex demo and a Python demo.

例子

import re

s = "This is a sentence which contains keyword1. And keyword2 is inside this sentence. "

pattern = re.compile(r"([A-Z][^.!?]*(?:(keyword1)|(keyword2))[^.!?]*[.!?])\s")
for match in pattern.findall(s):
    print(match)

输出

('This is a sentence which contains keyword1.', 'keyword1', '')
('And keyword2 is inside this sentence.', '', 'keyword2')

您可以尝试以下正则表达式:

[.?!]*\s*(.*(keyword1)[^.?!]*[.?!]|.*(keyword2)[^.?!]*[.?!])

代码:

import re

s = "This is a sentence which contains keyword1. And keyword2 is inside this sentence. "

pattern = re.compile(r"[.?!]*\s*(.*(keyword1)[^.?!]*[.?!]|.*(keyword2)[^.?!]*[.?!])")
for match in pattern.findall(s):
    print(match)

输出:

('This is a sentence which contains keyword1.', 'keyword1', '')
('And keyword2 is inside this sentence.', '', 'keyword2')