确保 gensim 为同一数据的不同运行生成相同的 Word2Vec 模型

Ensure the gensim generate the same Word2Vec model for different runs on the same data

LDA model generates different topics everytime i train on the same corpus中,通过设置np.random.seed(0),LDA模型将始终以完全相同的方式进行初始化和训练。

gensim中的Word2Vec模型也是一样的吗?通过将随机种子设置为常数,同一数据集上的不同 运行 会产生相同的模型吗?

但奇怪的是,它已经在不同的实例中为我提供了相同的向量。

>>> from nltk.corpus import brown
>>> from gensim.models import Word2Vec
>>> sentences = brown.sents()[:100]
>>> model = Word2Vec(sentences, size=10, window=5, min_count=5, workers=4)
>>> model[word0]
array([ 0.04985042,  0.02882229, -0.03625415, -0.03165979,  0.06049283,
        0.01207791,  0.04722737,  0.01984878, -0.03026265,  0.04485954], dtype=float32)
>>> model = Word2Vec(sentences, size=10, window=5, min_count=5, workers=4)
>>> model[word0]
array([ 0.04985042,  0.02882229, -0.03625415, -0.03165979,  0.06049283,
        0.01207791,  0.04722737,  0.01984878, -0.03026265,  0.04485954], dtype=float32)
>>> model = Word2Vec(sentences, size=20, window=5, min_count=5, workers=4)
>>> model[word0]
array([ 0.02596745,  0.01475067, -0.01839622, -0.01587902,  0.03079717,
        0.00586761,  0.02367715,  0.00930568, -0.01521437,  0.02213679,
        0.01043982, -0.00625582,  0.00173071, -0.00235749,  0.01309298,
        0.00710233, -0.02270884, -0.01477827,  0.01166443,  0.00283862], dtype=float32)
>>> model = Word2Vec(sentences, size=20, window=5, min_count=5, workers=4)
>>> model[word0]
array([ 0.02596745,  0.01475067, -0.01839622, -0.01587902,  0.03079717,
        0.00586761,  0.02367715,  0.00930568, -0.01521437,  0.02213679,
        0.01043982, -0.00625582,  0.00173071, -0.00235749,  0.01309298,
        0.00710233, -0.02270884, -0.01477827,  0.01166443,  0.00283862], dtype=float32)
>>> exit()
alvas@ubi:~$ python
Python 2.7.11 (default, Dec 15 2015, 16:46:19) 
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk.corpus import brown
>>> from gensim.models import Word2Vec
>>> sentences = brown.sents()[:100]
>>> model = Word2Vec(sentences, size=10, window=5, min_count=5, workers=4)
>>> word0 = sentences[0][0]
>>> model[word0]
array([ 0.04985042,  0.02882229, -0.03625415, -0.03165979,  0.06049283,
        0.01207791,  0.04722737,  0.01984878, -0.03026265,  0.04485954], dtype=float32)
>>> model = Word2Vec(sentences, size=20, window=5, min_count=5, workers=4)
>>> model[word0]
array([ 0.02596745,  0.01475067, -0.01839622, -0.01587902,  0.03079717,
        0.00586761,  0.02367715,  0.00930568, -0.01521437,  0.02213679,
        0.01043982, -0.00625582,  0.00173071, -0.00235749,  0.01309298,
        0.00710233, -0.02270884, -0.01477827,  0.01166443,  0.00283862], dtype=float32)

那么默认的随机种子是固定的吗?如果是,那么默认的随机种子数是多少?还是因为我正在测试一个小数据集?

如果随机种子是固定的并且 运行 在相同数据 returns 相同向量上的不同 returns 是真的,那么 link 到规范代码或文档将是非常感激。

是的,默认随机种子固定为 1,如作者在 https://radimrehurek.com/gensim/models/word2vec.html 中所述。每个单词的向量都使用单词 + str(seed) 的串联散列进行初始化。

然而,使用的散列函数是 Python 的基本内置散列函数,如果两台机器在

上不同,可能会产生不同的结果
  • 32 位与 64 位,reference
  • python 个版本,reference
  • 不同的操作系统/解释器,reference1, reference2

以上列表并不详尽。它是否涵盖了您的问题?

编辑

如果要保证一致性,可以在word2vec中提供自己的哈希函数作为参数

一个非常简单(而且不好)的例子是:

def hash(astring):
   return ord(astring[0])

model = Word2Vec(sentences, size=10, window=5, min_count=5, workers=4, hashfxn=hash)

print model[sentences[0][0]]

根据 Gensim 的文档,要执行完整的 deterministically-reproducible 运行,您 还必须 将模型限制为单个工作线程,以便从 OS 线程调度中消除排序抖动。

对您的代码进行简单的参数编辑应该可以解决问题。

model = Word2Vec(sentences, size=10, window=5, min_count=5, workers=1)

只是对随机性的评论。

如果使用的是 gensim 的 W2V 模型并且使用的是 Python 版本 >= 3.3,请记住默认情况下哈希随机化是打开的。如果您正在寻求两次执行之间的一致性,请确保设置 PYTHONHASHSEED 环境变量。例如。当 运行 你的代码像这样 PYTHONHASHSEED=123 python3 mycode.py,下次您生成模型(使用相同的哈希种子)时,它将与之前生成的模型相同(前提是遵循所有其他随机性控制步骤,如上所述 - 随机状态和单个工作者) . 有关详细信息,请参阅 gensim's W2V source and Python docs

对于完全确定性可重现的 运行,除了定义种子之外,您还必须将模型限制为单个工作线程 (workers=1),以消除 OS 线程的顺序抖动调度。 (在 Python 3 中,解释器启动之间的再现性也需要使用 PYTHONHASHSEED 环境变量来控制哈希随机化)。

def hash(astring):
  return ord(astring[0])

model = gensim.models.Word2Vec (texts, workers=1, seed=1,hashfxn=hash)

你的问题确实是一个小数据集:只有 100 个句子。

注意 Gensim FAQ 说的是什么:

[Because randomness is part of Word2Vec and similar models], it is to be expected that models vary from run to run, even trained on the same data. There's no single "right place" for any word-vector or doc-vector to wind up: just positions that are at progressively more-useful distances & directions from other vectors co-trained inside the same model. [...]

Suitable training parameters should yield models that are roughly as useful, from run-to-run, as each other. Testing and evaluation processes should be tolerant of any shifts in vector positions, and of small "jitter" in the overall utility of models, that arises from the inherent algorithm randomness. (If the observed quality from run-to-run varies a lot, there may be other problems: too little data, poorly-tuned parameters, or errors/weaknesses in the evaluation method.)

You can try to force determinism[.] But [...] you'd be obscuring the inherent randomness/approximateness of the underlying algorithms[.] It's better to tolerate a little jitter, and use excessive jitter as an indicator of problems elsewhere in the data or model setup – rather than impose a superficial determinism.

“PYTHONHASHSEED=0 python yourcode.py”应该可以解决您的问题。