正则表达式 - Python - 捕获单词之间的所有内容
Regex - Python - Capture everything between a word
是否可以捕获包含作为关键字(时间)的特定句子?示例:
`我想捕捉这部分(时间)和这部分。不是这句话,因为它不包含我们的关键字。还要这句话因为它包含了(time)'
-注1:时间原来不在括号内,代表时间范围:如:12:45、10:45等
-注意2:我正在寻找一个正则表达式,当这个关键字存在时,它可以捕获所有的句子。如果 findall 函数没有在句子中找到关键字,那么它会继续下一个句子。
-注3:最后我们得到了包含特定关键词的句子总和。
我添加了一些附加信息。测试您提供给我的代码和文本。
text = "He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14. The police found his body 10 minutes after the explosion"
capture_1 = re.findall("(?:\.|\A)(.*\d*:\d*.*)\.", text , flags=re.DOTALL)
capture_2 = re.findall(r'(\..*)(\d*:\d*)(.*) ',text, flags=re.DOTALL )
capture_1 给我这个:
['He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14'])
capture_2 给我这个:
[('。恐怖分子用遥控引爆装置摧毁了23:45处的建筑物。他在23:58处从露台的阳台上逃了出来。他没有生还。死亡时间是00', ':14', '.警方在10分钟后发现了他的尸体')])
不过我想要以下句子:
[(。恐怖分子用遥控引爆装置摧毁了23:45处的建筑物。他在23:58处从死亡terrace.Time的阳台逃脱了00:14')]
(?:\.|\A)([^.]*\d*:\d*[^.]*)\.
这会捕获两个句点之间或字符串开头和一个句点之间的所有字符串(因此您也可以捕获第一句话)。如果您的字符串包含换行符,您将需要使用 re.DOTALL 标志来确保 .
捕获新行。
例如:
re.findall("(?:\.|\A)([^.]*\d*:\d*[^.]*)\.", text, flags=re.DOTALL)
请注意,这将立即获取所有包含关键字的句子,因此无需逐句检查。
编辑:
我已经更改了上面的正则表达式以捕获包含您的关键字的每个句子,除非关键字紧邻 .
如果我可以建议使用列表理解的另一种技术:
[s for s in re.split('\.', text) if re.search('\d*:\d*', s)]
你的例子 returns:
[' The terrorist destroyed the building at 23:45 with a remote detonation device','
He escaped at 23:58 from the balcony of the terrace',
'Time of death was 00:14']
请注意,如果您的文本包含 .
不是句末句,这仍然 运行 会出现问题。例如: "Mr. Magoo ate beans and toast at 12:34" 将捕获: "Magoo ate beans at 12:34" 并且会错过 "Mr." 。
如果您 运行 遇到这个问题,我建议您将它作为一个单独的问题来提问。
好吧,您可以使用正则表达式轻松实现此目的。 (正面回顾和展望)
下面是使用上述正则表达式的示例。
import re
def replace_keyword(start, end, data):
if start == "":
start = "^"
if end == "":
end = "$"
rx = "(?<={0}).*(?={1})".format(start, end)
match = re.search(rx, data, re.DOTALL | re.MULTILINE)
if match:
return match.group() + end
else:
return data
data = "He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14. The police found his body 10 minutes after the explosion"
# empty string means start searching from begining of string.
start = ""
# empty end string means, search until end of string.
end = "00:14"
data = replace_keyword(start, end, data)
print data
在 运行 上面的代码之后,data
将包含文本
He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14
希望它能达到您的预期
UPDATE2 刚刚找到一个模式。演示是 HERE。希望对您有所帮助:
(?:^|\s+)([^.!?]*(?:\d\d:\d\d)[^.!?]*[.!?])
解释:
(?:^|\s+) Non-capturing group,
match start of sentence, or 1 or more spaces
( capturing group starts
[^.!?]* 0 or more times of characters except . ! or ?
(?:\d\d:\d\d) Non-capturing group,
match dd:dd time format
[^.!?]* 0 or more times of characters except . ! or ?
[.!?] sentence ends with . ! or ?
) capturing group ends
import re
text = "He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14. The police found his body 10 minutes after the explosion"
print ' '.join( re.findall('(?:^|\s+)([^.!?]*(?:\d\d:\d\d)[^.!?]*[.!?])', text))
输出:
The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. Time of death was 00:14.
是否可以捕获包含作为关键字(时间)的特定句子?示例:
`我想捕捉这部分(时间)和这部分。不是这句话,因为它不包含我们的关键字。还要这句话因为它包含了(time)'
-注1:时间原来不在括号内,代表时间范围:如:12:45、10:45等
-注意2:我正在寻找一个正则表达式,当这个关键字存在时,它可以捕获所有的句子。如果 findall 函数没有在句子中找到关键字,那么它会继续下一个句子。
-注3:最后我们得到了包含特定关键词的句子总和。
我添加了一些附加信息。测试您提供给我的代码和文本。
text = "He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14. The police found his body 10 minutes after the explosion"
capture_1 = re.findall("(?:\.|\A)(.*\d*:\d*.*)\.", text , flags=re.DOTALL)
capture_2 = re.findall(r'(\..*)(\d*:\d*)(.*) ',text, flags=re.DOTALL )
capture_1 给我这个:
['He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14'])
capture_2 给我这个:
[('。恐怖分子用遥控引爆装置摧毁了23:45处的建筑物。他在23:58处从露台的阳台上逃了出来。他没有生还。死亡时间是00', ':14', '.警方在10分钟后发现了他的尸体')])
不过我想要以下句子: [(。恐怖分子用遥控引爆装置摧毁了23:45处的建筑物。他在23:58处从死亡terrace.Time的阳台逃脱了00:14')]
(?:\.|\A)([^.]*\d*:\d*[^.]*)\.
这会捕获两个句点之间或字符串开头和一个句点之间的所有字符串(因此您也可以捕获第一句话)。如果您的字符串包含换行符,您将需要使用 re.DOTALL 标志来确保 .
捕获新行。
例如:
re.findall("(?:\.|\A)([^.]*\d*:\d*[^.]*)\.", text, flags=re.DOTALL)
请注意,这将立即获取所有包含关键字的句子,因此无需逐句检查。
编辑:
我已经更改了上面的正则表达式以捕获包含您的关键字的每个句子,除非关键字紧邻 .
如果我可以建议使用列表理解的另一种技术:
[s for s in re.split('\.', text) if re.search('\d*:\d*', s)]
你的例子 returns:
[' The terrorist destroyed the building at 23:45 with a remote detonation device','
He escaped at 23:58 from the balcony of the terrace',
'Time of death was 00:14']
请注意,如果您的文本包含 .
不是句末句,这仍然 运行 会出现问题。例如: "Mr. Magoo ate beans and toast at 12:34" 将捕获: "Magoo ate beans at 12:34" 并且会错过 "Mr." 。
如果您 运行 遇到这个问题,我建议您将它作为一个单独的问题来提问。
好吧,您可以使用正则表达式轻松实现此目的。 (正面回顾和展望)
下面是使用上述正则表达式的示例。
import re
def replace_keyword(start, end, data):
if start == "":
start = "^"
if end == "":
end = "$"
rx = "(?<={0}).*(?={1})".format(start, end)
match = re.search(rx, data, re.DOTALL | re.MULTILINE)
if match:
return match.group() + end
else:
return data
data = "He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14. The police found his body 10 minutes after the explosion"
# empty string means start searching from begining of string.
start = ""
# empty end string means, search until end of string.
end = "00:14"
data = replace_keyword(start, end, data)
print data
在 运行 上面的代码之后,data
将包含文本
He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14
希望它能达到您的预期
UPDATE2 刚刚找到一个模式。演示是 HERE。希望对您有所帮助:
(?:^|\s+)([^.!?]*(?:\d\d:\d\d)[^.!?]*[.!?])
解释:
(?:^|\s+) Non-capturing group,
match start of sentence, or 1 or more spaces
( capturing group starts
[^.!?]* 0 or more times of characters except . ! or ?
(?:\d\d:\d\d) Non-capturing group,
match dd:dd time format
[^.!?]* 0 or more times of characters except . ! or ?
[.!?] sentence ends with . ! or ?
) capturing group ends
import re
text = "He was there. The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. He did not survived. Time of death was 00:14. The police found his body 10 minutes after the explosion"
print ' '.join( re.findall('(?:^|\s+)([^.!?]*(?:\d\d:\d\d)[^.!?]*[.!?])', text))
输出:
The terrorist destroyed the building at 23:45 with a remote detonation device. He escaped at 23:58 from the balcony of the terrace. Time of death was 00:14.