通过不同的距离度量通过相同的首字母查找最接近的拼写

Question

我正在尝试编写一个函数，该函数将通过不同的 n-gram 和距离度量找到单词的最接近拼写（可能拼写不正确）'by the same first letter'。

对于我目前拥有的

from nltk.corpus import words
from nltk import ngrams
from nltk.metrics.distance import edit_distance, jaccard_distance
first_letters = ['A','B','C']
spellings = words.words()
    def recommendation(word):
        n = 3
# n means 'n'-grams, here I use 3 as an example 
        spellings_new = [w for w in spellings if (w[0] in first_letters)]
        dists = [________(set(ngrams(word, n)), set(ngrams(w, n))) for w in spellings_new]
# ______ is the distance measure
        return spellings_new[dists.index(min(dists))]

剩下的看起来很简单，但我不知道如何指定'same initial letter'条件。特别地，如果拼错的单词以字母 'A' 开头，那么从“.words”推荐的与拼错的单词具有最小距离度量的更正单词也应该以 'A' 开头。等等等等。正如您从上面的功能块中看到的那样，我使用“(w[0] in first_letters)”作为我的 'initial letter condition,' 但这并不能解决问题并且总是 return 字母以不同的首字母开头。我还没有在这个板上找到类似的线程来解决我的问题，如果有人能启发我如何指定 'initial letter condition'，我将不胜感激。如果以前有人问过这个问题并且认为不合适，我会删除它。

谢谢。

Answer 1

你真的很接近。 w[0] == word[0] 可用于检查首字母是否相同。之后 set(w) 和 set(word) 可用于将单词更改为字母组。然后我将其传递给 jaccard_distance，只是因为那是您已经导入的内容。可能有更好的解决方案。

def recommendation(word):
    n = 3
    # n means 'n'-grams, here I use 3 as an example
    spellings_new = [w for w in spellings if (w[0] == word[0])]
    dists = [jaccard_distance(set(w), set(word)) for w in spellings_new]
    return spellings_new[dists.index(min(dists))]

通过不同的距离度量通过相同的首字母查找最接近的拼写

Finding closest spellings by the same first letter via different distance measures

python

distance

corpus

n-gram