python 上使用 gensim Word2Vec 的不同模型
Different models with gensim Word2Vec on python
我正在尝试应用 python 库 gensim 中实现的 word2vec 模型。我有一个句子列表(每个句子都是一个单词列表)。
例如让我们有:
sentences=[['first','second','third','fourth']]*n
我实现了两个相同的模型:
model = gensim.models.Word2Vec(sententes, min_count=1,size=2)
model2=gensim.models.Word2Vec(sentences, min_count=1,size=2)
我意识到模型有时是相同的,有时是不同的,这取决于n的值。
比如n=100我得到
print(model['first']==model2['first'])
True
同时,对于 n=1000:
print(model['first']==model2['first'])
False
怎么可能?
非常感谢!
查看 gensim
documentation,当您 运行 Word2Vec
:
时会有一些随机化
seed
= for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread, to eliminate ordering jitter from OS thread scheduling.
因此,如果您想获得可重现的结果,您需要设置种子:
In [1]: import gensim
In [2]: sentences=[['first','second','third','fourth']]*1000
In [3]: model1 = gensim.models.Word2Vec(sentences, min_count = 1, size = 2)
In [4]: model2 = gensim.models.Word2Vec(sentences, min_count = 1, size = 2)
In [5]: print(all(model1['first']==model2['first']))
False
In [6]: model3 = gensim.models.Word2Vec(sentences, min_count = 1, size = 2, seed = 1234)
In [7]: model4 = gensim.models.Word2Vec(sentences, min_count = 1, size = 2, seed = 1234)
In [11]: print(all(model3['first']==model4['first']))
True
我正在尝试应用 python 库 gensim 中实现的 word2vec 模型。我有一个句子列表(每个句子都是一个单词列表)。
例如让我们有:
sentences=[['first','second','third','fourth']]*n
我实现了两个相同的模型:
model = gensim.models.Word2Vec(sententes, min_count=1,size=2)
model2=gensim.models.Word2Vec(sentences, min_count=1,size=2)
我意识到模型有时是相同的,有时是不同的,这取决于n的值。
比如n=100我得到
print(model['first']==model2['first'])
True
同时,对于 n=1000:
print(model['first']==model2['first'])
False
怎么可能?
非常感谢!
查看 gensim
documentation,当您 运行 Word2Vec
:
seed
= for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread, to eliminate ordering jitter from OS thread scheduling.
因此,如果您想获得可重现的结果,您需要设置种子:
In [1]: import gensim
In [2]: sentences=[['first','second','third','fourth']]*1000
In [3]: model1 = gensim.models.Word2Vec(sentences, min_count = 1, size = 2)
In [4]: model2 = gensim.models.Word2Vec(sentences, min_count = 1, size = 2)
In [5]: print(all(model1['first']==model2['first']))
False
In [6]: model3 = gensim.models.Word2Vec(sentences, min_count = 1, size = 2, seed = 1234)
In [7]: model4 = gensim.models.Word2Vec(sentences, min_count = 1, size = 2, seed = 1234)
In [11]: print(all(model3['first']==model4['first']))
True