Python NLTK WUP 相似度分数对完全相同的词不一致

Question

像下面这样的简单代码给出了两种情况下 0.75 的相似度分数。如您所见，这两个词完全相同。为了避免混淆，我还将一个词与其自身进行了比较。分数拒绝从 0.75 膨胀。这是怎么回事？

from nltk.corpus import wordnet as wn
actual=wn.synsets('orange')[0]
predicted=wn.synsets('orange')[0]
similarity=actual.wup_similarity(predicted)
print similarity
similarity=actual.wup_similarity(actual)
print similarity

Answer 1

这是一个有趣的问题。

TL;DR:

抱歉，这个问题没有简短的答案=(

太长了，想看:

查看 wup_similarity() 的代码，问题不是来自相似性计算，而是 NLTK 遍历 WordNet 层次结构以获得 lowest_common_hypernym() 的方式（参见 https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L805）。

通常，同义词集与其自身之间的最低公共上位词必须是其自身：

>>> from nltk.corpus import wordnet as wn
>>> y = wn.synsets('car')[0]
>>> y.lowest_common_hypernyms(y, use_min_depth=True)
[Synset('car.n.01')]

但是在 orange 的情况下它也给出 fruit:

>>> from nltk.corpus import wordnet as wn
>>> x = wn.synsets('orange')[0]
>>> x.lowest_common_hypernyms(x, use_min_depth=True)
[Synset('fruit.n.01'), Synset('orange.n.01')]

我们必须从 https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L805

的文档字符串中查看 lowest_common_hypernym() 的代码

Get a list of lowest synset(s) that both synsets have as a hypernym. When use_min_depth == False this means that the synset which appears as a hypernym of both self and other with the lowest maximum depth is returned or if there are multiple such synsets at the same depth they are all returned However, if use_min_depth == True then the synset(s) which has/have the lowest minimum depth and appear(s) in both paths is/are returned

那么让我们试试 lowest_common_hypernym() 和 use_min_depth=False:

>>> x.lowest_common_hypernyms(x, use_min_depth=False)
[Synset('orange.n.01')]

似乎解决了绑定路径的歧义。但是 wup_similarity() API 没有 use_min_depth 参数：

>>> x.wup_similarity(x, use_min_depth=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: wup_similarity() got an unexpected keyword argument 'use_min_depth'

请注意，当 use_min_depth==False 时，lowest_common_hypernym 在遍历同义词集时检查最大深度。但是当 use_min_depth==True 时，它会检查最小深度，请参阅 https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L602

因此，如果我们跟踪 lowest_common_hypernym 代码：

>>> synsets_to_search = x.common_hypernyms(x)
>>> synsets_to_search
[Synset('citrus.n.01'), Synset('natural_object.n.01'), Synset('orange.n.01'), Synset('object.n.01'), Synset('plant_organ.n.01'), Synset('edible_fruit.n.01'), Synset('produce.n.01'), Synset('food.n.02'), Synset('physical_entity.n.01'), Synset('entity.n.01'), Synset('reproductive_structure.n.01'), Synset('solid.n.01'), Synset('matter.n.03'), Synset('plant_part.n.01'), Synset('fruit.n.01'), Synset('whole.n.02')]

# if use_min_depth==True
>>> max_depth = max(x.min_depth() for x in synsets_to_search)
>>> max_depth
8
>>> unsorted_lowest_common_hypernym = [s for s in synsets_to_search if s.min_depth() == max_depth]
>>> unsorted_lowest_common_hypernym
[Synset('orange.n.01'), Synset('fruit.n.01')]
>>> 
# if use_min_depth==False
>>> max_depth = max(x.max_depth() for x in synsets_to_search)
>>> max_depth
11
>>> unsorted_lowest_common_hypernym = [s for s in synsets_to_search if s.max_depth() == max_depth]
>>> unsorted_lowest_common_hypernym
[Synset('orange.n.01')]

这种wup_similarity的奇怪现象实际上在代码注释中突出显示，https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L843

# Note that to preserve behavior from NLTK2 we set use_min_depth=True
# It is possible that more accurate results could be obtained by
# removing this setting and it should be tested later on
subsumers = self.lowest_common_hypernyms(other, simulate_root=simulate_root and need_root, use_min_depth=True)

并且当在 https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L843 处选择列表中的第一个 subsumer 时：

subsumer = subsumers[0]

当然，对于 orange 同义词集，fruit 会首先被选中，因为它是列表中最常见的上位词并列的第一个。

总而言之，默认参数是一种功能，而不是与 NLTK v2.x 一样保持可重复性的错误。

所以解决方案可能是手动更改 NLTK 源以强制 use_min_depth=False:

https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L845

已编辑

要解决此问题，您可以对同一同义词集进行临时检查：

def wup_similarity_hacked(synset1, synset2):
  if synset1 == synset2:
    return 1.0
  else:
    return synset1.wup_similarity(synset2)

Python NLTK WUP 相似度分数对完全相同的词不一致

Python NLTK WUP Similarity Score not unity for exact same word

python

nlp

similarity

nltk