Python 正则表达式:我想找到所有重叠和不重叠的模式匹配
Python Regex : I would like to find all overlapping and non overlapping pattern matches
我想找到所有重叠和不重叠的模式匹配项
代码如下:
import re
words = [r"\bhello\b",r"\bworld\b",r"\bhello world\b"]
sentence = "Hola hello world and hello"
for word in words:
for match in re.finditer(word,sentence):
print(match.span(),match.group())
给我以下结果(我很高兴,但需要一种有效的方法)
(5, 10) hello
(21, 26) hello
(11, 16) world
(5, 16) hello world
我知道这样效率不高。
示例:假设我有 20k 个单词和 10k 个句子,这将是对 re.match 的 200M x 2 次调用,这会花费很多时间。
你能给我一个解决问题的有效方法吗?
顺序不同,但结果相同,而且速度明显更快:
import re
substrings=['hello','world','hello world']
joined='|'.join(substrings)
reg=re.compile(rf"\b(?={joined}\b)")
for m in reg.finditer(sentence):
for e in substrings:
offset=m.span()[0]
if sentence[offset:offset+len(e)]==e:
print((offset,offset+len(e)), e)
如果你想确保不匹配hello worldLY
(即,只是子字符串的前缀)你可以这样做:
substrings=['hello','world','hello world']
joined='|'.join(substrings)
reg=re.compile(rf"\b(?={joined}\b)")
indvidual=list(map(re.compile, [rf'\b({e})\b' for e in substrings]))
for m in reg.finditer(sentence):
for i,e in enumerate(indvidual):
offset=m.span()[0]
if m2:=e.match(sentence[offset:]):
print((offset,offset+m2.span()[1]), substrings[i])
要么打印:
(5, 10) hello
(5, 16) hello world
(11, 16) world
(21, 26) hello
我想找到所有重叠和不重叠的模式匹配项
代码如下:
import re
words = [r"\bhello\b",r"\bworld\b",r"\bhello world\b"]
sentence = "Hola hello world and hello"
for word in words:
for match in re.finditer(word,sentence):
print(match.span(),match.group())
给我以下结果(我很高兴,但需要一种有效的方法)
(5, 10) hello
(21, 26) hello
(11, 16) world
(5, 16) hello world
我知道这样效率不高。 示例:假设我有 20k 个单词和 10k 个句子,这将是对 re.match 的 200M x 2 次调用,这会花费很多时间。
你能给我一个解决问题的有效方法吗?
顺序不同,但结果相同,而且速度明显更快:
import re
substrings=['hello','world','hello world']
joined='|'.join(substrings)
reg=re.compile(rf"\b(?={joined}\b)")
for m in reg.finditer(sentence):
for e in substrings:
offset=m.span()[0]
if sentence[offset:offset+len(e)]==e:
print((offset,offset+len(e)), e)
如果你想确保不匹配hello worldLY
(即,只是子字符串的前缀)你可以这样做:
substrings=['hello','world','hello world']
joined='|'.join(substrings)
reg=re.compile(rf"\b(?={joined}\b)")
indvidual=list(map(re.compile, [rf'\b({e})\b' for e in substrings]))
for m in reg.finditer(sentence):
for i,e in enumerate(indvidual):
offset=m.span()[0]
if m2:=e.match(sentence[offset:]):
print((offset,offset+m2.span()[1]), substrings[i])
要么打印:
(5, 10) hello
(5, 16) hello world
(11, 16) world
(21, 26) hello