用字符串中的单词替换单词
Replace a word by a word in a string
我有如下字典
word_dict = {'a': 'a1', 'winter': 'cold', 'summer': 'hot'}
我有一个这样的字符串:
data = "It's winter not summer. Have a nice day"
我想做的是替换a by a1
、winter by cold
等字样在data
中。我确实尝试使用以下代码:
for word in word_dict:
data = data.replace(word, word_dict[word])
但它失败了,因为它替换了子字符串(data
的子字符串,而不是单词)。事实上,单词 Have
被替换为 Ha1ve
。
结果应该是:
data = "It's cold not hot. Have a1 nice day"
您可以使用 re.sub
。 \b
匹配单词字符和非单词字符的单词边界。我们需要使用单词边界来匹配一个精确的单词字符串,否则它也会匹配 day
中的 a
>>> word_dict = {'a': 'a1', 'winter': 'cold', 'summer': 'hot'}
>>> data = "It's winter not summer. Have a nice day"
>>> for word in word_dict:
data = re.sub(r'\b'+word+r'\b', word_dict[word], data)
>>> data
"It's cold not hot. Have a1 nice day"
除了正则表达式之外,还有多种方法可以实现这一点:
ldata = data.split(' ') #splits by whitespace characters
res = []
for i in ldata:
if i in word_dict:
res.append(word_dict[i])
else:
res.append(i)
final = ' '.join(res)
正则表达式解决方案更实用,适合您想要的,但 list.split() 和 string.join() 方法有时会派上用场。 :)
您可以在 join()
函数中使用生成器:
>>> word_dict = {'a': 'a1', 'winter': 'cold', 'summer': 'hot'}
>>> data = "It's winter not summer. Have a nice day"
>>> ' '.join(word_dict[j] if j in word_dict else j for j in data.split())
"It's cold not summer. Have a1 nice day"
通过拆分数据,您可以在其词中进行搜索,然后使用简单的理解来替换特定的词。
使用 dict.get 拆分和 " "
拆分以保持正确的间距:
from string import punctuation
print(" ".join([word_dict.get(x.rstrip(punctuation), x) for x in data.split(" ")]))
It's cold not hot. Have a1 nice day
我们还需要去除标点符号,以便 summer.
匹配 summer
等...
一些时间显示即使拆分和剥离非正则表达式方法仍然快两倍以上:
In [18]: %%timeit data = "It's winter not summer. Have a nice day"
for word in word_dict:
data = re.sub(r'\b'+word+r'\b', word_dict[word], data)
....:
100000 loops, best of 3: 12.2 µs per loop
In [19]: timeit " ".join([word_dict.get(x.rstrip(punctuation), x) for x in data.split(" ")])
100000 loops, best of 3: 5.52 µs per loop
我有如下字典
word_dict = {'a': 'a1', 'winter': 'cold', 'summer': 'hot'}
我有一个这样的字符串:
data = "It's winter not summer. Have a nice day"
我想做的是替换a by a1
、winter by cold
等字样在data
中。我确实尝试使用以下代码:
for word in word_dict:
data = data.replace(word, word_dict[word])
但它失败了,因为它替换了子字符串(data
的子字符串,而不是单词)。事实上,单词 Have
被替换为 Ha1ve
。
结果应该是:
data = "It's cold not hot. Have a1 nice day"
您可以使用 re.sub
。 \b
匹配单词字符和非单词字符的单词边界。我们需要使用单词边界来匹配一个精确的单词字符串,否则它也会匹配 day
a
>>> word_dict = {'a': 'a1', 'winter': 'cold', 'summer': 'hot'}
>>> data = "It's winter not summer. Have a nice day"
>>> for word in word_dict:
data = re.sub(r'\b'+word+r'\b', word_dict[word], data)
>>> data
"It's cold not hot. Have a1 nice day"
除了正则表达式之外,还有多种方法可以实现这一点:
ldata = data.split(' ') #splits by whitespace characters
res = []
for i in ldata:
if i in word_dict:
res.append(word_dict[i])
else:
res.append(i)
final = ' '.join(res)
正则表达式解决方案更实用,适合您想要的,但 list.split() 和 string.join() 方法有时会派上用场。 :)
您可以在 join()
函数中使用生成器:
>>> word_dict = {'a': 'a1', 'winter': 'cold', 'summer': 'hot'}
>>> data = "It's winter not summer. Have a nice day"
>>> ' '.join(word_dict[j] if j in word_dict else j for j in data.split())
"It's cold not summer. Have a1 nice day"
通过拆分数据,您可以在其词中进行搜索,然后使用简单的理解来替换特定的词。
使用 dict.get 拆分和 " "
拆分以保持正确的间距:
from string import punctuation
print(" ".join([word_dict.get(x.rstrip(punctuation), x) for x in data.split(" ")]))
It's cold not hot. Have a1 nice day
我们还需要去除标点符号,以便 summer.
匹配 summer
等...
一些时间显示即使拆分和剥离非正则表达式方法仍然快两倍以上:
In [18]: %%timeit data = "It's winter not summer. Have a nice day"
for word in word_dict:
data = re.sub(r'\b'+word+r'\b', word_dict[word], data)
....:
100000 loops, best of 3: 12.2 µs per loop
In [19]: timeit " ".join([word_dict.get(x.rstrip(punctuation), x) for x in data.split(" ")])
100000 loops, best of 3: 5.52 µs per loop