正则表达式 - Python - 捕获单词之间的所有内容

Regex - Python - Capture everything between a word

是否可以捕获包含作为关键字(时间)的特定句子?示例:

`我想捕捉这部分(时间)和这部分。不是这句话,因为它不包含我们的关键字。还要这句话因为它包含了(time)'

-注1:时间原来不在括号内,代表时间范围:如:12:45、10:45等

-注意2:我正在寻找一个正则表达式,当这个关键字存在时,它可以捕获所有的句子。如果 findall 函数没有在句子中找到关键字,那么它会继续下一个句子。

-注3:最后我们得到了包含特定关键词的句子总和。

我添加了一些附加信息。测试您提供给我的代码和文本。

text = "He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14. The police found his body 10 minutes after the explosion"

capture_1 = re.findall("(?:\.|\A)(.*\d*:\d*.*)\.", text , flags=re.DOTALL)
capture_2 = re.findall(r'(\..*)(\d*:\d*)(.*) ',text, flags=re.DOTALL )

capture_1 给我这个:

['He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14'])

capture_2 给我这个:

[('。恐怖分子用遥控引爆装置摧毁了23:45处的建筑物。他在23:58处从露台的阳台上逃了出来。他没有生还。死亡时间是00', ':14', '.警方在10分钟后发现了他的尸体')])

不过我想要以下句子: [(。恐怖分子用遥控引爆装置摧毁了23:45处的建筑物。他在23:58处从死亡terrace.Time的阳台逃脱了00:14')]

(?:\.|\A)([^.]*\d*:\d*[^.]*)\.

这会捕获两个句点之间或字符串开头和一个句点之间的所有字符串(因此您也可以捕获第一句话)。如果您的字符串包含换行符,您将需要使用 re.DOTALL 标志来确保 . 捕获新行。

例如:

re.findall("(?:\.|\A)([^.]*\d*:\d*[^.]*)\.", text, flags=re.DOTALL)

请注意,这将立即获取所有包含关键字的句子,因此无需逐句检查。

编辑:

我已经更改了上面的正则表达式以捕获包含您的关键字的每个句子,除非关键字紧邻 .
如果我可以建议使用列表理解的另一种技术:

[s for s in re.split('\.', text) if re.search('\d*:\d*', s)]

你的例子 returns:

[' The terrorist destroyed the building at 23:45 with a remote detonation device',' 
He escaped at 23:58 from the balcony of the terrace', 
'Time of death was 00:14']

请注意,如果您的文本包含 . 不是句末句,这仍然 运行 会出现问题。例如: "Mr. Magoo ate beans and toast at 12:34" 将捕获: "Magoo ate beans at 12:34" 并且会错过 "Mr." 。

如果您 运行 遇到这个问题,我建议您将它作为一个单独的问题来提问。

好吧,您可以使用正则表达式轻松实现此目的。 (正面回顾和展望)

下面是使用上述正则表达式的示例。

import re


def replace_keyword(start, end, data):
    if start == "":
        start = "^"

    if end == "":
        end = "$"

    rx = "(?<={0}).*(?={1})".format(start, end)
    match = re.search(rx, data, re.DOTALL | re.MULTILINE)
    if match:
        return match.group() + end
    else:
        return data


data = "He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14. The police found his body 10 minutes after the explosion"

# empty string means start searching from begining of string.
start = ""

# empty end string means, search until end of string.
end = "00:14"

data = replace_keyword(start, end, data)

print data

在 运行 上面的代码之后,data 将包含文本

He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14

希望它能达到您的预期

UPDATE2 刚刚找到一个模式。演示是 HERE。希望对您有所帮助:

(?:^|\s+)([^.!?]*(?:\d\d:\d\d)[^.!?]*[.!?])

解释:

(?:^|\s+)       Non-capturing group,
                match start of sentence, or 1 or more spaces
(               capturing group starts
[^.!?]*         0 or more times of characters except . ! or ?
(?:\d\d:\d\d)   Non-capturing group,
                match dd:dd time format
[^.!?]*         0 or more times of characters except . ! or ?
[.!?]           sentence ends with . ! or ?
)               capturing group ends

import re
text = "He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14. The police found his body 10 minutes after the explosion"
print  ' '.join( re.findall('(?:^|\s+)([^.!?]*(?:\d\d:\d\d)[^.!?]*[.!?])', text))

输出:

The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. Time of death was 00:14.