正则表达式 - Python - 捕获单词之间的所有内容

Question

是否可以捕获包含作为关键字（时间）的特定句子？示例：

`我想捕捉这部分（时间）和这部分。不是这句话，因为它不包含我们的关键字。还要这句话因为它包含了(time)'

-注1：时间原来不在括号内，代表时间范围：如：12:45、10:45等

-注意2：我正在寻找一个正则表达式，当这个关键字存在时，它可以捕获所有的句子。如果 findall 函数没有在句子中找到关键字，那么它会继续下一个句子。

-注3：最后我们得到了包含特定关键词的句子总和。

我添加了一些附加信息。测试您提供给我的代码和文本。

text = "He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14. The police found his body 10 minutes after the explosion"

capture_1 = re.findall("(?:\.|\A)(.*\d*:\d*.*)\.", text , flags=re.DOTALL)
capture_2 = re.findall(r'(\..*)(\d*:\d*)(.*) ',text, flags=re.DOTALL )

capture_1 给我这个：

['He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14'])

capture_2 给我这个：

[('。恐怖分子用遥控引爆装置摧毁了23:45处的建筑物。他在23:58处从露台的阳台上逃了出来。他没有生还。死亡时间是00', ':14', '.警方在10分钟后发现了他的尸体')])

不过我想要以下句子： [(。恐怖分子用遥控引爆装置摧毁了23:45处的建筑物。他在23:58处从死亡terrace.Time的阳台逃脱了00:14')]

Answer 1

(?:\.|\A)([^.]*\d*:\d*[^.]*)\.

这会捕获两个句点之间或字符串开头和一个句点之间的所有字符串（因此您也可以捕获第一句话）。如果您的字符串包含换行符，您将需要使用 re.DOTALL 标志来确保 . 捕获新行。

例如：

re.findall("(?:\.|\A)([^.]*\d*:\d*[^.]*)\.", text, flags=re.DOTALL)

请注意，这将立即获取所有包含关键字的句子，因此无需逐句检查。

编辑：

我已经更改了上面的正则表达式以捕获包含您的关键字的每个句子，除非关键字紧邻 .
如果我可以建议使用列表理解的另一种技术：

[s for s in re.split('\.', text) if re.search('\d*:\d*', s)]

你的例子 returns:

[' The terrorist destroyed the building at 23:45 with a remote detonation device',' 
He escaped at 23:58 from the balcony of the terrace', 
'Time of death was 00:14']

请注意，如果您的文本包含 . 不是句末句，这仍然运行会出现问题。例如： "Mr. Magoo ate beans and toast at 12:34" 将捕获： "Magoo ate beans at 12:34" 并且会错过 "Mr." 。

如果您运行遇到这个问题，我建议您将它作为一个单独的问题来提问。

Answer 2

好吧，您可以使用正则表达式轻松实现此目的。（正面回顾和展望）

下面是使用上述正则表达式的示例。

import re


def replace_keyword(start, end, data):
    if start == "":
        start = "^"

    if end == "":
        end = "$"

    rx = "(?<={0}).*(?={1})".format(start, end)
    match = re.search(rx, data, re.DOTALL | re.MULTILINE)
    if match:
        return match.group() + end
    else:
        return data


data = "He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14. The police found his body 10 minutes after the explosion"

# empty string means start searching from begining of string.
start = ""

# empty end string means, search until end of string.
end = "00:14"

data = replace_keyword(start, end, data)

print data

在运行上面的代码之后，data 将包含文本

He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14

希望它能达到您的预期

Answer 3

UPDATE2 刚刚找到一个模式。演示是 HERE。希望对您有所帮助：

(?:^|\s+)([^.!?]*(?:\d\d:\d\d)[^.!?]*[.!?])

解释：

(?:^|\s+)       Non-capturing group,
                match start of sentence, or 1 or more spaces
(               capturing group starts
[^.!?]*         0 or more times of characters except . ! or ?
(?:\d\d:\d\d)   Non-capturing group,
                match dd:dd time format
[^.!?]*         0 or more times of characters except . ! or ?
[.!?]           sentence ends with . ! or ?
)               capturing group ends

import re
text = "He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14. The police found his body 10 minutes after the explosion"
print  ' '.join( re.findall('(?:^|\s+)([^.!?]*(?:\d\d:\d\d)[^.!?]*[.!?])', text))

输出：

The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. Time of death was 00:14.

正则表达式 - Python - 捕获单词之间的所有内容

Regex - Python - Capture everything between a word

regex

keyword

findall

python-2.7

编辑：