Py_stringmatching 广义 Jaccard 的问题

Question

我正在使用 Py_stringmatching 包中的 GeneralizedJaccard 来衡量两个字符串之间的相似性。根据this document:

... If the similarity of a token pair exceeds the threshold, then the token pair is considered a match ...

例如，对于单词对 'method' 和 'methods'，我们有：

print(sm.Levenshtein().get_sim_score('method','methods'))
>> 0.8571428571428572

示例词对之间的相似度为 0.85 且大于 0.80，因此这对必须考虑匹配，我预计两个近乎重复的句子的最终 GeneralizedJaccard 输出 等于 1 但它是 0.97:

import py_stringmatching as sm

str1 = 'All tokenizers have a tokenize method'
str2 = 'All tokenizers have a tokenize methods'
alphabet_tok_set = sm.AlphabeticTokenizer(return_set=True)
gj = sm.GeneralizedJaccard(sim_func=sm.Levenshtein().get_sim_score, threshold = 0.8)

print(gj.get_raw_score(alphabet_tok_set.tokenize(str1),alphabet_tok_set.tokenize(str2)))

>> 0.9761904761904763

那么问题是什么？！

Answer 1

答案是，在将这对视为匹配后，Jaccard 公式中使用该对的相似度得分而不是 1。

Py_stringmatching 广义 Jaccard 的问题

Problem with Py_stringmatching GeneralizedJaccard

python

string

text

compare

similarity