Word2Vec 的随机方面是什么?

What is the stochastic aspect of Word2Vec?

我正在使用 Gensim 对几个不同语料库中的单词进行矢量化,得到的结果让我重新思考 Word2Vec 的功能。我的理解是 Word2Vec 是确定性的,一个词在向量中的位置 space 不会随着训练的不同而改变。如果 "My cat is running" 和 "your dog can't be running" 是语料库中的两个句子,那么 "running"(或其词干)的值似乎必然是固定的。

但是,我发现该值确实因模型而异,并且当我训练模型时,单词在矢量 space 上的位置不断变化。这些差异并不总是意义重大,但它们确实表明存在某种随机过程。我在这里错过了什么?

虽然我不知道 Word2Vec 在 gensim 中的任何实现细节,但我知道,一般来说,Word2Vec 是由一个简单的神经网络训练的,第一层是嵌入层。这个嵌入层的权重矩阵包含了我们感兴趣的词向量。

也就是说,随机初始化神经网络的权重通常也很常见。所以你有随机性的起源。

但是,无论不同的(随机)起始条件,结果怎么会不同呢?

训练有素的模型会将相似的向量分配给具有相似含义的词。这种相似性是通过两个向量之间夹角的余弦来衡量的。从数学上讲,如果 vw 是两个非常相似的词的向量,那么

dot(v, w) / (len(v) * len(w)) # this formula gives you the cosine of the angle between v and w

将接近 1

此外,它还可以让你像著名的

那样做算术运算
king - man + woman = queen

为了便于说明,请想象二维向量。如果你,例如,这些算术性质会丢失吗?围绕原点旋转某个角度?凭借一点数学背景,我可以向你保证:不,他们不会!

所以,你的假设

If "My cat is running" and "your dog can't be running" are the two sentences in the corpus, then the value of "running" (or its stem) seems necessarily fixed.

错了。 "running" 的值根本不固定。然而,(以某种方式)固定的是与其他词的相似度(余弦)和算术关系。

这在 Gensim FAQ 中有详细介绍,我在这里引用:

Q11: I've trained my Word2Vec/Doc2Vec/etc model repeatedly using the exact same text corpus, but the vectors are different each time. Is there a bug or have I made a mistake? (*2vec training non-determinism)

Answer: The *2vec models (word2vec, fasttext, doc2vec…) begin with random initialization, then most modes use additional randomization during training. (For example, the training windows are randomly truncated as an efficient way of weighting nearer words higher. The negative examples in the default negative-sampling mode are chosen randomly. And the downsampling of highly-frequent words, as controlled by the sample parameter, is driven by random choices. These behaviors were all defined in the original Word2Vec paper's algorithm description.)

Even when all this randomness comes from a pseudorandom-number-generator that's been seeded to give a reproducible stream of random numbers (which gensim does by default), the usual case of multi-threaded training can further change the exact training-order of text examples, and thus the final model state. (Further, in Python 3.x, the hashing of strings is randomized each re-launch of the Python interpreter - changing the iteration ordering of vocabulary dicts from run to run, and thus making even the same string-of-random-number-draws pick different words in different launches.)

So, it is to be expected that models vary from run to run, even trained on the same data. There's no single "right place" for any word-vector or doc-vector to wind up: just positions that are at progressively more-useful distances & directions from other vectors co-trained inside the same model. (In general, only vectors that were trained together in an interleaved session of contrasting uses become comparable in their coordinates.)

Suitable training parameters should yield models that are roughly as useful, from run-to-run, as each other. Testing and evaluation processes should be tolerant of any shifts in vector positions, and of small "jitter" in the overall utility of models, that arises from the inherent algorithm randomness. (If the observed quality from run-to-run varies a lot, there may be other problems: too little data, poorly-tuned parameters, or errors/weaknesses in the evaluation method.)

You can try to force determinism, by using workers=1 to limit training to a single thread – and, if in Python 3.x, using the PYTHONHASHSEED environment variable to disable its usual string hash randomization. But training will be much slower than with more threads. And, you'd be obscuring the inherent randomness/approximateness of the underlying algorithms, in a way that might make results more fragile and dependent on the luck of a particular setup. It's better to tolerate a little jitter, and use excessive jitter as an indicator of problems elsewhere in the data or model setup – rather than impose a superficial determinism.