通过查找单词的线性代数结构来评估 Word2Vec 模型

Question

我在 python.I 中使用 gensim 库构建了 Word2Vec 模型，想按如下方式评估我的词嵌入

If A is related to B and C is related to D, then A-C+B should be equal to D. For example, embedding vector arithmetic of "India"-"Rupee"+"Japan" should be equal to the embedding of "Yen".

我已经使用了 gensim 的内置函数，如 predict_output_word、most_similar，但无法获得所需的结果。

new_model.predict_output_word(['india','rupee','japan'],topn=10)
new_model.most_similar(positive=['india', 'rupee'], negative=['japan'])

请帮助我根据上述标准评估我的模型。

Answer 1

您应该以与 accuracy() 方法相同的方式处理 most_similar() 方法的 positive 和 negative 参数：

https://github.com/RaRe-Technologies/gensim/blob/718b1c6bd1a8a98625993d73b83d98baf385752d/gensim/models/keyedvectors.py#L697

具体来说，如果您有 "A is to B as C is to [expected]" 形式的类比，您应该查看：

results = model.most_similar(positive=[word_b, word_c], negative=[word_a])

或者在你的例子中：

results = model.most_similar(positive=['rupee', 'japan'], negative=['india'])

通过查找单词的线性代数结构来评估 Word2Vec 模型

Evaluating Word2Vec model by finding linear algebraic structure of words

nlp

word2vec

word-embedding