查找包含特定单词的所有句子
Find all sentences containing specific words
我有一个由句子组成的字符串,我想找到包含至少一个特定关键字的所有句子,即 keyword1
或 keyword2
:
import re
s = "This is a sentence which contains keyword1. And keyword2 is inside this sentence. "
pattern = re.compile(r"([A-Z][^\.!?].*(keyword1)|(keyword2).*[\.!?])\s")
for match in pattern.findall(s):
print(match)
输出:
('This is a sentence which contains keyword1', 'keyword1', '')
('keyword2 is inside this sentence. ', '', 'keyword2')
预期输出:
('This is a sentence which contains keyword1', 'keyword1', '')
('And keyword2 is inside this sentence. ', '', 'keyword2')
如您所见,第二个匹配项不包含第一组中的整个句子。我在这里错过了什么?
可以用取反符class不匹配.
!
和?
,将关键字放在同一组,防止出现空串结果。
然后 re.findall returns 捕获组值,对于整个匹配是第 1 组,对于其中一个关键字是第 2、3 组等。
([A-Z][^.!?]*(?:(keyword1)|(keyword2))[^.!?]*[.!?])\s
说明
(
捕获 组 1
[A-Z][^.!?]*
匹配大写字符 A-Z 和可选的除 .!?
之一以外的任何字符
(?:(keyword1)|(keyword2))
抓取自己组中的关键词之一
[^.!?]*[.!?]
匹配除 .!?
之一之外的任何字符,然后匹配 .!?
之一
)
关闭组 1
\s
匹配空白字符
看到一个regex demo and a Python demo.
例子
import re
s = "This is a sentence which contains keyword1. And keyword2 is inside this sentence. "
pattern = re.compile(r"([A-Z][^.!?]*(?:(keyword1)|(keyword2))[^.!?]*[.!?])\s")
for match in pattern.findall(s):
print(match)
输出
('This is a sentence which contains keyword1.', 'keyword1', '')
('And keyword2 is inside this sentence.', '', 'keyword2')
您可以尝试以下正则表达式:
[.?!]*\s*(.*(keyword1)[^.?!]*[.?!]|.*(keyword2)[^.?!]*[.?!])
代码:
import re
s = "This is a sentence which contains keyword1. And keyword2 is inside this sentence. "
pattern = re.compile(r"[.?!]*\s*(.*(keyword1)[^.?!]*[.?!]|.*(keyword2)[^.?!]*[.?!])")
for match in pattern.findall(s):
print(match)
输出:
('This is a sentence which contains keyword1.', 'keyword1', '')
('And keyword2 is inside this sentence.', '', 'keyword2')
我有一个由句子组成的字符串,我想找到包含至少一个特定关键字的所有句子,即 keyword1
或 keyword2
:
import re
s = "This is a sentence which contains keyword1. And keyword2 is inside this sentence. "
pattern = re.compile(r"([A-Z][^\.!?].*(keyword1)|(keyword2).*[\.!?])\s")
for match in pattern.findall(s):
print(match)
输出:
('This is a sentence which contains keyword1', 'keyword1', '')
('keyword2 is inside this sentence. ', '', 'keyword2')
预期输出:
('This is a sentence which contains keyword1', 'keyword1', '')
('And keyword2 is inside this sentence. ', '', 'keyword2')
如您所见,第二个匹配项不包含第一组中的整个句子。我在这里错过了什么?
可以用取反符class不匹配.
!
和?
,将关键字放在同一组,防止出现空串结果。
然后 re.findall returns 捕获组值,对于整个匹配是第 1 组,对于其中一个关键字是第 2、3 组等。
([A-Z][^.!?]*(?:(keyword1)|(keyword2))[^.!?]*[.!?])\s
说明
(
捕获 组 1[A-Z][^.!?]*
匹配大写字符 A-Z 和可选的除.!?
之一以外的任何字符
(?:(keyword1)|(keyword2))
抓取自己组中的关键词之一[^.!?]*[.!?]
匹配除.!?
之一之外的任何字符,然后匹配.!?
之一
)
关闭组 1\s
匹配空白字符
看到一个regex demo and a Python demo.
例子
import re
s = "This is a sentence which contains keyword1. And keyword2 is inside this sentence. "
pattern = re.compile(r"([A-Z][^.!?]*(?:(keyword1)|(keyword2))[^.!?]*[.!?])\s")
for match in pattern.findall(s):
print(match)
输出
('This is a sentence which contains keyword1.', 'keyword1', '')
('And keyword2 is inside this sentence.', '', 'keyword2')
您可以尝试以下正则表达式:
[.?!]*\s*(.*(keyword1)[^.?!]*[.?!]|.*(keyword2)[^.?!]*[.?!])
代码:
import re
s = "This is a sentence which contains keyword1. And keyword2 is inside this sentence. "
pattern = re.compile(r"[.?!]*\s*(.*(keyword1)[^.?!]*[.?!]|.*(keyword2)[^.?!]*[.?!])")
for match in pattern.findall(s):
print(match)
输出:
('This is a sentence which contains keyword1.', 'keyword1', '')
('And keyword2 is inside this sentence.', '', 'keyword2')