'word not in the vocabulary' 使用 Gensim Word2Vec.most_similar 方法评估相似性时

Question

gensim.models.Word2Vec.most_similar

我得到了前 N 个最相似的词。

我用像

这样的句子列表训练了一个模型

list_of_list = [["i like going to the beach"],
                ["the war is over"], 
                ["we are all made of stars"],  
                         ...
                ["i don't know what to do"]] 
model = gensim.models.Word2Vec(list_of_list, size=100, window=longest_list, min_count=2)

suggestions = model.most_similar("I don't know what to do", topn=10)

我想评估短语相似度。

例如我运行

suggestions = model.most_similar("I don't know what to do", topn=10)

它工作正常。

但是如果我给出像 "to the beach" 或 "what to do" 这样的子查询，它会 returns 一条错误消息，因为子短语不在词汇表中。

 "word 'to the beach' not in vocabulary"

如何在不重新训练模型的情况下解决这个问题？该模型如何根据新短语识别最相似的短语，而不是子短语？

Answer 1

看来您没有正确训练 Word2Vec 模型。句子应该是单词列表而不是单个字符串列表。因此，如果您将其更改为：

list_of_list = [["i like going to the beach"],
                ["the war is over"], 
                ["we are all made of stars"],  
                         ...
                ["i don't know what to do"]]

list_for_training = [sent[0].split() for sent in list_of_list]

并使用list_for_training作为Word2Vec的构造函数的第一个参数。

同样，调用most_similar方法时，提供字符串列表而不是字符串：

suggestions = model.most_similar("I don't know what to do".split(), topn=10)

或

suggestions = model.most_similar("to the beach".split(), topn=10)

'word not in the vocabulary' 使用 Gensim Word2Vec.most_similar 方法评估相似性时

'word not in the vocabulary' when evaluating similarity using Gensim Word2Vec.most_similar method

python

nlp

similarity

gensim