python 上使用 gensim Word2Vec 的不同模型

Question

我正在尝试应用 python 库 gensim 中实现的 word2vec 模型。我有一个句子列表（每个句子都是一个单词列表）。

例如让我们有：

sentences=[['first','second','third','fourth']]*n

我实现了两个相同的模型：

model = gensim.models.Word2Vec(sententes, min_count=1,size=2)
model2=gensim.models.Word2Vec(sentences, min_count=1,size=2)

我意识到模型有时是相同的，有时是不同的，这取决于n的值。

比如n=100我得到

print(model['first']==model2['first'])
True

同时，对于 n=1000：

print(model['first']==model2['first'])
False

怎么可能？

非常感谢！

Answer 1

查看 gensim documentation，当您运行 Word2Vec:

时会有一些随机化

seed = for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread, to eliminate ordering jitter from OS thread scheduling.

因此，如果您想获得可重现的结果，您需要设置种子：

In [1]: import gensim

In [2]: sentences=[['first','second','third','fourth']]*1000

In [3]: model1 = gensim.models.Word2Vec(sentences, min_count = 1, size = 2)

In [4]: model2 = gensim.models.Word2Vec(sentences, min_count = 1, size = 2)

In [5]: print(all(model1['first']==model2['first']))
False

In [6]: model3 = gensim.models.Word2Vec(sentences, min_count = 1, size = 2, seed = 1234)

In [7]: model4 = gensim.models.Word2Vec(sentences, min_count = 1, size = 2, seed = 1234)

In [11]: print(all(model3['first']==model4['first']))
True

python 上使用 gensim Word2Vec 的不同模型

Different models with gensim Word2Vec on python

python

nlp

gensim

word2vec