即使单词不同,如何找到两个问题之间的相似性

how to find similarity between two question even though the words are differentiate

有没有什么办法可以查出字符串的意思是否相似,,,虽然字符串中的单词是有区别的

直到现在我尝试了模糊模糊、levenstein 距离、余弦相似度来匹配字符串,但所有匹配的都是单词而不是单词的含义

Str1 = "what are types of negotiation"
Str2 = "what are advantages of negotiation"
Str3 = "what are categories of negotiation"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
Ratio1 = fuzz.ratio(Str1.lower(),Str3.lower())
Partial_Ratio1 = fuzz.partial_ratio(Str1.lower(),Str3.lower())
Token_Sort_Ratio1 = fuzz.token_sort_ratio(Str1,Str3)
print("fuzzywuzzy")
print(Str1," ",Str2," ",Ratio)
print(Str1," ",Str2," ",Partial_Ratio)
print(Str1," ",Str2," ",Token_Sort_Ratio)
print(Str1," ",Str3," ",Ratio1)
print(Str1," ",Str3," ",Partial_Ratio1)
print(Str1," ",Str3," ",Token_Sort_Ratio1)
print("levenshtein ratio")
Ratio = levenshtein_ratio_and_distance(Str1,Str2,ratio_calc = True)
Ratio1 = levenshtein_ratio_and_distance(Str1,Str3,ratio_calc = True)
print(Str1," ",Str2," ",Ratio)
print(Str1," ",Str3," ",Ratio)

output:
fuzzywuzzy
what are types of negotiation   what are advantages of negotiation   86
what are types of negotiation   what are advantages of negotiation   76
what are types of negotiation   what are advantages of negotiation   73
what are types of negotiation   what are categories of negotiation   86
what are types of negotiation   what are categories of negotiation   76
what are types of negotiation   what are categories of negotiation   73
levenshtein ratio
what are types of negotiation   what are advantages of negotiation               
0.8571428571428571
what are types of negotiation   what are categories of negotiation       
0.8571428571428571



expected output:
"what are the types of negotiation skill?"
"what are the categories in negotiation skill?"
output:similar
"what are the types of negotiation skill?"
"what are the advantages of negotiation skill?"
output:not similar

您想对两个字符串的语义相似度进行评分。

Fuzzy-wuzzy 和 Levenshtein 距离仅对字符距离进行评分。

您需要考虑语义信息。因此,您需要字符串的语义表示。

也许一个简单但有效的方法在于:

  1. 计算代表你的两个字符串的两个向量,使用你的语言的预训练词嵌入(例如 FastText - get_sentence_vector https://fasttext.cc/docs/en/python-module.html#model-object
  2. 计算两个向量之间的余弦相似度(1:相等的字符串;0:完全不同的字符串)。

当然,还有更好更复杂的方法。 为了深入理解这个主题,我建议这个 post (https://medium.com/@adriensieg/text-similarities-da019229c894),它有丰富的解释和代码实现。