'word not in the vocabulary' 使用 Gensim Word2Vec.most_similar 方法评估相似性时

'word not in the vocabulary' when evaluating similarity using Gensim Word2Vec.most_similar method

通过method

gensim.models.Word2Vec.most_similar

我得到了前 N 个最相似的词。

我用像

这样的句子列表训练了一个模型
list_of_list = [["i like going to the beach"],
                ["the war is over"], 
                ["we are all made of stars"],  
                         ...
                ["i don't know what to do"]] 
model = gensim.models.Word2Vec(list_of_list, size=100, window=longest_list, min_count=2)

suggestions = model.most_similar("I don't know what to do", topn=10)       

我想评估短语相似度。

例如我运行

suggestions = model.most_similar("I don't know what to do", topn=10)       

它工作正常。

但是如果我给出像 "to the beach""what to do" 这样的子查询,它会 returns 一条错误消息,因为子短语不在词汇表中。

 "word 'to the beach' not in vocabulary"

如何在不重新训练模型的情况下解决这个问题? 该模型如何根据新短语识别最相似的短语,而不是子短语?

看来您没有正确训练 Word2Vec 模型。句子应该是单词列表而不是单个字符串列表。因此,如果您将其更改为:

list_of_list = [["i like going to the beach"],
                ["the war is over"], 
                ["we are all made of stars"],  
                         ...
                ["i don't know what to do"]]

list_for_training = [sent[0].split() for sent in list_of_list]

并使用list_for_training作为Word2Vec的构造函数的第一个参数。

同样,调用most_similar方法时,提供字符串列表而不是字符串:

suggestions = model.most_similar("I don't know what to do".split(), topn=10)  

suggestions = model.most_similar("to the beach".split(), topn=10)