python 上使用 gensim Word2Vec 的不同模型

Different models with gensim Word2Vec on python

我正在尝试应用 python 库 gensim 中实现的 word2vec 模型。我有一个句子列表(每个句子都是一个单词列表)。

例如让我们有:

sentences=[['first','second','third','fourth']]*n

我实现了两个相同的模型:

model = gensim.models.Word2Vec(sententes, min_count=1,size=2)
model2=gensim.models.Word2Vec(sentences, min_count=1,size=2)

我意识到模型有时是相同的,有时是不同的,这取决于n的值。

比如n=100我得到

print(model['first']==model2['first'])
True

同时,对于 n=1000:

print(model['first']==model2['first'])
False

怎么可能?

非常感谢!

查看 gensim documentation,当您 运行 Word2Vec:

时会有一些随机化

seed = for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread, to eliminate ordering jitter from OS thread scheduling.

因此,如果您想获得可重现的结果,您需要设置种子:

In [1]: import gensim

In [2]: sentences=[['first','second','third','fourth']]*1000

In [3]: model1 = gensim.models.Word2Vec(sentences, min_count = 1, size = 2)

In [4]: model2 = gensim.models.Word2Vec(sentences, min_count = 1, size = 2)

In [5]: print(all(model1['first']==model2['first']))
False

In [6]: model3 = gensim.models.Word2Vec(sentences, min_count = 1, size = 2, seed = 1234)

In [7]: model4 = gensim.models.Word2Vec(sentences, min_count = 1, size = 2, seed = 1234)

In [11]: print(all(model3['first']==model4['first']))
True