WordNet 的同义词集中有顺序吗?

Is there any order in WordNet's synsets?

我正在使用 WordNet 访问具有共同含义的同义词。这是一个例子:

from itertools import chain
from nltk.corpus import wordnet as wn

synsets = wn.synsets("drink")
# synsets = [Synset('drink.n.01'), Synset('drink.n.02'), Synset('beverage.n.01'), ...]
synonyms = set(chain(*(x.lemma_names() for x in synsets)))
# synonyms = {'drinking', 'drinkable', 'crapulence', 'toast', 'drink', 'drunkenness', ...}

同义词集是否排序?如果是,标准是什么?列表中的第一个同义词集与给定词相关的可能性更高吗?

我想通过只保留 "most important" 来限制同义词的数量("important" 在此上下文中的含义有待定义,但我想知道 WordNet 是否有自己的概念"important").

如果同义词集未排序,有什么替代方法可以找到单词最合适的同义词?

文档有相关部分:https://www.nltk.org/howto/wordnet.html#similarity

提供了多种相似度查找方法:path_similaritylch_similaritywup_similarityres_similarity

例如,来自文档(for path_similarity):

synset1.path_similarity(synset2): Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1.

您可以使用以下格式的方法:

# Assuming we are comparing with 0th synset of "drink"
syn_to_compare = wn.synsets("drink")[0]
all_synsets = wn.synsets("drink")
corr = [(all_synsets[i],syn_to_compare.path_similarity(all_synsets[i])) for i in range(len(all_synsets))]

将生成如下输出:

[(Synset('drink.n.01'), 1.0), (Synset('drink.n.02'), 0.06666666666666667), (Synset('beverage.n.01'), 0.08333333333333333), (Synset('drink.n.04'), 0.09090909090909091), (Synset('swallow.n.02'), 0.07692307692307693), (Synset('drink.v.01'), None), (Synset('drink.v.02'), None), (Synset('toast.v.02'), None), (Synset('drink_in.v.01'), None), (Synset('drink.v.05'), None)]

然后您可以使用 sorted() 方法对它们进行排序,提供 similarity_score 作为值。

sorted(corr, key=lambda x: x[1] if x[1] != None else 0, reverse=True)
[(Synset('drink.n.01'), 1.0), (Synset('drink.n.04'), 0.09090909090909091), (Synset('beverage.n.01'), 0.08333333333333333), (Synset('swallow.n.02'), 0.07692307692307693), (Synset('drink.n.02'), 0.06666666666666667), (Synset('drink.v.01'), None), (Synset('drink.v.02'), None), (Synset('toast.v.02'), None), (Synset('drink_in.v.01'), None), (Synset('drink.v.05'), None)]

如果你想处理专有名词,我建议研究一下gensim的most_similar()方法。

Are synsets sorted? And, in case they are, what are the criteria? Are the first synsets of the list those which have higher chances to be correlated to the given word?

这个问题我不能果断回答,但是我觉得没有标准。您可以使用上述方法根据特定同义词集查找最相似的词。

编辑: 正如下面评论中提到的,问题的作者在 wordnet synsets() 返回的列表中寻找 order方法.

来自 Github 上可用的代码:https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1563 方法 synset()

if lang == "eng":
    get_synset = self.synset_from_pos_and_offset
    index = self._lemma_pos_offset_map
    if pos is None:
        pos = POS_LIST
    return [
        get_synset(p, offset)
        for p in pos
        for form in self._morphy(lemma, p, check_exceptions)
        for offset in index[form].get(p, [])
    ]

其中 POS_LIST 的值为:POS_LIST = [NOUN, VERB, ADJ, ADV]。 因此,优先考虑上述顺序。此外,根据他们的代码:NOUN="n", VERB="v", ADJ="a", ADV="r"

所以顺序主要看nltk的pos标签基于POS_LIST,其次是什么方法_morphy() returns with lemma and pos 标签,后跟 what _lemma_pos_offset_map() returns.

例如:

>>> POS_LIST = ["n", "v", "a", "r"]
>>> syn = list()
>>> lemma = "drink"
>>> for p in POS_LIST:
...     for form in wn._morphy(lemma, p, True):
...             for offset in wn._lemma_pos_offset_map[form].get(p, []):
...                     syn.append(wn.synset_from_pos_and_offset(p, offset))
... 
>>> syn
[Synset('drink.n.01'), Synset('drink.n.02'), Synset('beverage.n.01'), Synset('drink.n.04'), Synset('swallow.n.02'), Synset('drink.v.01'), Synset('drink.v.02'), Synset('toast.v.02'), Synset('drink_in.v.01'), Synset('drink.v.05')]
>>> # You can verify it with what synsets() is providing
... 
KeyboardInterrupt
>>> wn.synsets("drink")
[Synset('drink.n.01'), Synset('drink.n.02'), Synset('beverage.n.01'), Synset('drink.n.04'), Synset('swallow.n.02'), Synset('drink.v.01'), Synset('drink.v.02'), Synset('toast.v.02'), Synset('drink_in.v.01'), Synset('drink.v.05')]
>>> 

希望更新后的回答对您有所帮助!

我有点晚了,但我也在找订单,我在他们的网页上找到了这个: 它们按估计的使用频率排序。 官方网站上写着:

"-syns (n | v | a | r )
Display synonyms and immediate hypernyms of synsets containing searchstr. 
Synsets are ordered by estimated frequency of use. [...]"

来源:https://wordnet.princeton.edu/documentation/wn1wn