两个词有多相似?
How similar are two words?
我想衡量两个词之间的相似度。相似性将是一个用 c++ 编写的函数,其中 return 一个介于 0 和 1 之间的浮点数。如果这两个词非常相似,那么该浮点数将接近 1,如果它们非常不同,那么它将接近0. 例如,“Analyse”和“Analise”可能 return 0.95,“Substracting”和“describe”可能 return 接近 0。我如何在 C++ 中做到这一点。
尝试:
float similarity(const std::string& word1, const std::string& word2) const{
const std::size_t len1 = word1.size();
const std::size_t len2 = word2.size();
float score = 0;
for(size_t i = 0; i<std::min(len1,len2);i++){
score += (float)(word1[i]==word2[i])/len1;
}
return score;
}
还好吗?我能做得更好吗?我在这里不需要机器学习。这只是为了测试目的,但我也不能让它太糟糕。上面的尝试是可以的,但是还不够
我认为最好也是唯一的方法是机器学习。如果你想用 C++ 做到这一点,那将非常困难。例如,我会推荐 python 和 TensorFlow。
看看Levenshtein Distance and Levenshtein Distance Implementation
您可以使用上述算法的结果来实现您所需要的
稍后编辑:
#include <iostream>
#include <map>
#include <vector>
unsigned int edit_distance(const std::string& s1, const std::string& s2) {
const std::size_t len1 = s1.size(), len2 = s2.size();
std::vector<std::vector<unsigned int>> d(len1 + 1, std::vector<unsigned int>(len2 + 1));
d[0][0] = 0;
for(unsigned int i = 1; i <= len1; ++i) d[i][0] = i;
for(unsigned int i = 1; i <= len2; ++i) d[0][i] = i;
for(unsigned int i = 1; i <= len1; ++i)
for(unsigned int j = 1; j <= len2; ++j)
d[i][j] = std::min(std::min(d[i - 1][j] + 1, d[i][j - 1] + 1),
d[i - 1][j - 1] + (s1[i - 1] == s2[j - 1] ? 0 : 1));
return d[len1][len2];
}
float similarity(const std::string& s1, const std::string& s2) {
return 1 - 1.0 * edit_distance(s1, s2) / std::max(s1.size(), s2.size());
}
int main() {
std::vector<std::pair<std::string, std::string>> words = {
{ "Julius", "Iulius" },
{ "Frank", "Blank" },
{ "George", "Dog" },
{ "Cat", "Elephant" },
{ "Cucumber", "Tomato" }
};
for (const auto& word_pair : words) {
std::cout << "Similarity between [" << word_pair.first << "] & ["
<< word_pair.second << "]: " << similarity(word_pair.first, word_pair.second)
<< std::endl;
}
return 0;
}
和输出:
Similarity between [Julius] & [Iulius]: 0.833333
Similarity between [Frank] & [Blank]: 0.6
Similarity between [George] & [Dog]: 0.333333
Similarity between [Cat] & [Elephant]: 0.25
Similarity between [Cucumber] & [Tomato]: 0.125
我想衡量两个词之间的相似度。相似性将是一个用 c++ 编写的函数,其中 return 一个介于 0 和 1 之间的浮点数。如果这两个词非常相似,那么该浮点数将接近 1,如果它们非常不同,那么它将接近0. 例如,“Analyse”和“Analise”可能 return 0.95,“Substracting”和“describe”可能 return 接近 0。我如何在 C++ 中做到这一点。
尝试:
float similarity(const std::string& word1, const std::string& word2) const{
const std::size_t len1 = word1.size();
const std::size_t len2 = word2.size();
float score = 0;
for(size_t i = 0; i<std::min(len1,len2);i++){
score += (float)(word1[i]==word2[i])/len1;
}
return score;
}
还好吗?我能做得更好吗?我在这里不需要机器学习。这只是为了测试目的,但我也不能让它太糟糕。上面的尝试是可以的,但是还不够
我认为最好也是唯一的方法是机器学习。如果你想用 C++ 做到这一点,那将非常困难。例如,我会推荐 python 和 TensorFlow。
看看Levenshtein Distance and Levenshtein Distance Implementation
您可以使用上述算法的结果来实现您所需要的
稍后编辑:
#include <iostream>
#include <map>
#include <vector>
unsigned int edit_distance(const std::string& s1, const std::string& s2) {
const std::size_t len1 = s1.size(), len2 = s2.size();
std::vector<std::vector<unsigned int>> d(len1 + 1, std::vector<unsigned int>(len2 + 1));
d[0][0] = 0;
for(unsigned int i = 1; i <= len1; ++i) d[i][0] = i;
for(unsigned int i = 1; i <= len2; ++i) d[0][i] = i;
for(unsigned int i = 1; i <= len1; ++i)
for(unsigned int j = 1; j <= len2; ++j)
d[i][j] = std::min(std::min(d[i - 1][j] + 1, d[i][j - 1] + 1),
d[i - 1][j - 1] + (s1[i - 1] == s2[j - 1] ? 0 : 1));
return d[len1][len2];
}
float similarity(const std::string& s1, const std::string& s2) {
return 1 - 1.0 * edit_distance(s1, s2) / std::max(s1.size(), s2.size());
}
int main() {
std::vector<std::pair<std::string, std::string>> words = {
{ "Julius", "Iulius" },
{ "Frank", "Blank" },
{ "George", "Dog" },
{ "Cat", "Elephant" },
{ "Cucumber", "Tomato" }
};
for (const auto& word_pair : words) {
std::cout << "Similarity between [" << word_pair.first << "] & ["
<< word_pair.second << "]: " << similarity(word_pair.first, word_pair.second)
<< std::endl;
}
return 0;
}
和输出:
Similarity between [Julius] & [Iulius]: 0.833333
Similarity between [Frank] & [Blank]: 0.6
Similarity between [George] & [Dog]: 0.333333
Similarity between [Cat] & [Elephant]: 0.25
Similarity between [Cucumber] & [Tomato]: 0.125