两个词有多相似?

How similar are two words?

我想衡量两个词之间的相似度。相似性将是一个用 c++ 编写的函数,其中 return 一个介于 0 和 1 之间的浮点数。如果这两个词非常相似,那么该浮点数将接近 1,如果它们非常不同,那么它将接近0. 例如,“Analyse”和“Analise”可能 return 0.95,“Substracting”和“describe”可能 return 接近 0。我如何在 C++ 中做到这一点。

尝试:

float similarity(const std::string& word1, const std::string& word2) const{
    const std::size_t len1 = word1.size();
    const std::size_t len2 = word2.size();
    float score = 0;
    for(size_t i = 0; i<std::min(len1,len2);i++){
        score += (float)(word1[i]==word2[i])/len1;
    }
    return score;
}

还好吗?我能做得更好吗?我在这里不需要机器学习。这只是为了测试目的,但我也不能让它太糟糕。上面的尝试是可以的,但是还不够

我认为最好也是唯一的方法是机器学习。如果你想用 C++ 做到这一点,那将非常困难。例如,我会推荐 python 和 TensorFlow。

看看Levenshtein Distance and Levenshtein Distance Implementation

您可以使用上述算法的结果来实现您所需要的

稍后编辑:

#include <iostream>
#include <map>
#include <vector>

unsigned int edit_distance(const std::string& s1, const std::string& s2) {
    const std::size_t len1 = s1.size(), len2 = s2.size();
    std::vector<std::vector<unsigned int>> d(len1 + 1, std::vector<unsigned int>(len2 + 1));

    d[0][0] = 0;
    for(unsigned int i = 1; i <= len1; ++i) d[i][0] = i;
    for(unsigned int i = 1; i <= len2; ++i) d[0][i] = i;

    for(unsigned int i = 1; i <= len1; ++i)
        for(unsigned int j = 1; j <= len2; ++j)
                      d[i][j] = std::min(std::min(d[i - 1][j] + 1, d[i][j - 1] + 1),
                                         d[i - 1][j - 1] + (s1[i - 1] == s2[j - 1] ? 0 : 1));
    return d[len1][len2];
}

float similarity(const std::string& s1, const std::string& s2) {
    return 1 - 1.0 * edit_distance(s1, s2) / std::max(s1.size(), s2.size());
}

int main() {
    std::vector<std::pair<std::string, std::string>> words = {
        { "Julius", "Iulius" },
        { "Frank", "Blank" },
        { "George", "Dog" },
        { "Cat", "Elephant" },
        { "Cucumber", "Tomato" }
    };
    for (const auto& word_pair : words) {
        std::cout << "Similarity between [" << word_pair.first << "] & ["
        << word_pair.second << "]: " << similarity(word_pair.first, word_pair.second)
        << std::endl;
    }
    return 0;
}

和输出:

Similarity between [Julius] & [Iulius]: 0.833333
Similarity between [Frank] & [Blank]: 0.6
Similarity between [George] & [Dog]: 0.333333
Similarity between [Cat] & [Elephant]: 0.25
Similarity between [Cucumber] & [Tomato]: 0.125