如何在集合中找到相似的词?
How to find similar word in set?
word = "work"
word_set = {"word","look","wrap","pork"}
如何找到相似的词,使得"word"和"pork"只需要一个字母就可以变成"work"?
我想知道是否有一种方法可以找出字符串与集合中的项目之间的区别。
你可以这样做:
word = "work"
word_set = set(["word","look","wrap","pork"])
for example in word_set:
if len(example) != len(word):
continue
num_chars_out = sum([1 for c1,c2 in zip(example, word) if c1 != c2])
if num_chars_out == 1:
print(example)
我会推荐 editdistance Python package,它提供了一个 editdistance.eval
函数,可以计算从第一个单词到第二个单词需要更改的字符数。编辑距离与 Levenshtein 距离相同,由 MattDMo 建议。
在你的例子中,如果你想识别彼此在 1 个编辑距离内的单词,你可以这样做:
import editdistance as ed
thresh = 1
w1 = "work"
word_set = set(["word","look","wrap","pork"])
neighboring_words = [w2 for w2 in word_set if ed.eval(w1, w2) <= thresh]
print neighboring_words
neighboring_words
评估为 ['pork', 'word']
。
使用标准库中的difflib.get_close_matches()
:
import difflib
word = "work"
word_set = {"word","look","wrap","pork"}
difflib.get_close_matches(word, word_set)
returns:
['word', 'pork']
EDIT 如果需要,可以用difflib.SequenceMatcher.get_opcodes()
计算编辑距离:
matcher = difflib.SequenceMatcher(b=word)
for test_word in word_set:
matcher.set_seq1(test_word)
distance = len([m for m in matcher.get_opcodes() if m[0]!='equal'])
print(distance, test_word)
word = "work"
word_set = {"word","look","wrap","pork"}
如何找到相似的词,使得"word"和"pork"只需要一个字母就可以变成"work"?
我想知道是否有一种方法可以找出字符串与集合中的项目之间的区别。
你可以这样做:
word = "work"
word_set = set(["word","look","wrap","pork"])
for example in word_set:
if len(example) != len(word):
continue
num_chars_out = sum([1 for c1,c2 in zip(example, word) if c1 != c2])
if num_chars_out == 1:
print(example)
我会推荐 editdistance Python package,它提供了一个 editdistance.eval
函数,可以计算从第一个单词到第二个单词需要更改的字符数。编辑距离与 Levenshtein 距离相同,由 MattDMo 建议。
在你的例子中,如果你想识别彼此在 1 个编辑距离内的单词,你可以这样做:
import editdistance as ed
thresh = 1
w1 = "work"
word_set = set(["word","look","wrap","pork"])
neighboring_words = [w2 for w2 in word_set if ed.eval(w1, w2) <= thresh]
print neighboring_words
neighboring_words
评估为 ['pork', 'word']
。
使用标准库中的difflib.get_close_matches()
:
import difflib
word = "work"
word_set = {"word","look","wrap","pork"}
difflib.get_close_matches(word, word_set)
returns:
['word', 'pork']
EDIT 如果需要,可以用difflib.SequenceMatcher.get_opcodes()
计算编辑距离:
matcher = difflib.SequenceMatcher(b=word)
for test_word in word_set:
matcher.set_seq1(test_word)
distance = len([m for m in matcher.get_opcodes() if m[0]!='equal'])
print(distance, test_word)