子字符串在字符串中的位置
Positions of substrings in string
我需要知道一个单词在文本中的所有位置 - 字符串中的子字符串。到目前为止的解决方案是使用正则表达式,但我不确定是否有更好的内置标准库策略。有什么想法吗?
import re
text = "The quick brown fox jumps over the lazy dog. fox. Redfox."
links = {'fox': [], 'dog': []}
re_capture = u"(^|[^\w\-/])(%s)([^\w\-/]|$)" % "|".join(links.keys())
iterator = re.finditer(re_capture, text)
if iterator:
for match in iterator:
# fix position by context
# (' ', 'fox', ' ')
m_groups = match.groups()
start, end = match.span()
start = start + len(m_groups[0])
end = end - len(m_groups[2])
key = m_groups[1]
links[key].append((start, end))
print links
{'fox': [(16, 19), (45, 48)], 'dog': [(40, 43)]}
Edit: Partial words are not allowed to match - see fox of Redfox is not in links.
谢谢。
不像 pythonic 且没有正则表达式:
text = "The quick brown fox jumps over the lazy dog. fox."
links = {'fox': [], 'dog': []}
for key in links:
pos = 0
while(True):
pos = text.find(key, pos)
if pos < 0:
break
links[key].append((pos, pos + len(key)))
pos = pos + 1
print(links)
如果你想匹配实际的单词并且你的字符串包含 ascii:
text = "fox The quick brown fox jumps over the fox! lazy dog. fox!."
links = {'fox': [], 'dog': []}
from string import punctuation
def yield_words(s,d):
i = 0
for ele in s.split(" "):
tot = len(ele) + 1
ele = ele.rstrip(punctuation)
ln = len(ele)
if ele in d:
d[ele].append((i, ln + i))
i += tot
return d
这与查找解决方案不同,它不会匹配部分单词并在 O(n)
时间内完成:
In [2]: text = "The quick brown fox jumps over the lazy dog. fox. Redfox."
In [3]: links = {'fox': [], 'dog': []}
In [4]: yield_words(text,links)
Out[4]: {'dog': [(40, 43)], 'fox': [(16, 19), (45, 48)]}
这可能是正则表达式是一种好方法的一种情况,它可以简单得多:
def reg_iter(s,d):
r = re.compile("|".join([r"\b{}\b".format(w) for w in d]))
for match in r.finditer(s):
links[match.group()].append((match.start(),match.end()))
return d
输出:
In [6]: links = {'fox': [], 'dog': []}
In [7]: text = "The quick brown fox jumps over the lazy dog. fox. Redfox."
In [8]: reg_iter(text, links)
Out[8]: {'dog': [(40, 43)], 'fox': [(16, 19), (45, 48)]}
我需要知道一个单词在文本中的所有位置 - 字符串中的子字符串。到目前为止的解决方案是使用正则表达式,但我不确定是否有更好的内置标准库策略。有什么想法吗?
import re
text = "The quick brown fox jumps over the lazy dog. fox. Redfox."
links = {'fox': [], 'dog': []}
re_capture = u"(^|[^\w\-/])(%s)([^\w\-/]|$)" % "|".join(links.keys())
iterator = re.finditer(re_capture, text)
if iterator:
for match in iterator:
# fix position by context
# (' ', 'fox', ' ')
m_groups = match.groups()
start, end = match.span()
start = start + len(m_groups[0])
end = end - len(m_groups[2])
key = m_groups[1]
links[key].append((start, end))
print links
{'fox': [(16, 19), (45, 48)], 'dog': [(40, 43)]}
Edit: Partial words are not allowed to match - see fox of Redfox is not in links.
谢谢。
不像 pythonic 且没有正则表达式:
text = "The quick brown fox jumps over the lazy dog. fox."
links = {'fox': [], 'dog': []}
for key in links:
pos = 0
while(True):
pos = text.find(key, pos)
if pos < 0:
break
links[key].append((pos, pos + len(key)))
pos = pos + 1
print(links)
如果你想匹配实际的单词并且你的字符串包含 ascii:
text = "fox The quick brown fox jumps over the fox! lazy dog. fox!."
links = {'fox': [], 'dog': []}
from string import punctuation
def yield_words(s,d):
i = 0
for ele in s.split(" "):
tot = len(ele) + 1
ele = ele.rstrip(punctuation)
ln = len(ele)
if ele in d:
d[ele].append((i, ln + i))
i += tot
return d
这与查找解决方案不同,它不会匹配部分单词并在 O(n)
时间内完成:
In [2]: text = "The quick brown fox jumps over the lazy dog. fox. Redfox."
In [3]: links = {'fox': [], 'dog': []}
In [4]: yield_words(text,links)
Out[4]: {'dog': [(40, 43)], 'fox': [(16, 19), (45, 48)]}
这可能是正则表达式是一种好方法的一种情况,它可以简单得多:
def reg_iter(s,d):
r = re.compile("|".join([r"\b{}\b".format(w) for w in d]))
for match in r.finditer(s):
links[match.group()].append((match.start(),match.end()))
return d
输出:
In [6]: links = {'fox': [], 'dog': []}
In [7]: text = "The quick brown fox jumps over the lazy dog. fox. Redfox."
In [8]: reg_iter(text, links)
Out[8]: {'dog': [(40, 43)], 'fox': [(16, 19), (45, 48)]}