CBOWv.s。 skip-gram:为什么要反转上下文和目标词?

CBOW v.s. skip-gram: why invert context and target words?

this页面中,据说:

[...] skip-gram inverts contexts and targets, and tries to predict each context word from its target word [...]

但是,查看它生成的训练数据集,X 和 Y 对的内容似乎可以互换,因为那两对 (X, Y):

(quick, brown), (brown, quick)

那么,如果上下文和目标最终是一回事,为什么还要区分那么多呢?

另外,做 Udacity's Deep Learning course exercise on word2vec,我想知道为什么他们在这个问题上似乎在这两种方法之间做出如此大的区别:

An alternative to skip-gram is another Word2Vec model called CBOW (Continuous Bag of Words). In the CBOW model, instead of predicting a context word from a word vector, you predict a word from the sum of all the word vectors in its context. Implement and evaluate a CBOW model trained on the text8 dataset.

这不会产生相同的结果吗?

它与您在任何给定时间点的准确计算有关。如果您开始查看为每个概率计算包含更大上下文的模型,差异将变得更加明显。

在 skip-gram 中,您根据句子中当前位置的词计算上下文词;您是计算中的 "skipping" 当前单词(可能还有一些上下文)。结果可以超过一个词(但如果您的上下文 window 只是一个词长则不会)。

在 CBOW 中,您是根据上下文词计算当前词,因此您只会得到一个词作为结果。

这是我对差异的过于简单和相当幼稚的理解:

正如我们所知,CBOW 正在学习根据上下文预测单词。或者通过查看上下文最大化目标词的概率。而这恰好是生僻词的问题。例如,给定上下文 yesterday was a really [...] day CBOW 模型会告诉您该词最有可能是 beautifulnice。像 delightful 这样的词将很少受到模型的关注,因为它旨在预测最可能的词。这个词会在很多有更频繁词的例子中平滑。

另一方面,skip-gram模型旨在预测上下文。给定单词 delightful 它必须理解它并告诉我们上下文很可能是 yesterday was really [...] day 或其他一些相关上下文。对于 skip-gram,单词 delightful 不会尝试与单词 beautiful 竞争,而是 delightful+context 对将被视为新的观察。

更新

感谢@0xF 分享this article

According to Mikolov

Skip-gram: works well with small amount of the training data, represents well even rare words or phrases.

CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words

找到主题的另一个补充here:

In the "skip-gram" mode alternative to "CBOW", rather than averaging the context words, each is used as a pairwise training example. That is, in place of one CBOW example such as [predict 'ate' from average('The', 'cat', 'the', 'mouse')], the network is presented with four skip-gram examples [predict 'ate' from 'The'], [predict 'ate' from 'cat'], [predict 'ate' from 'the'], [predict 'ate' from 'mouse']. (The same random window-reduction occurs, so half the time that would just be two examples, of the nearest words.)

在深度学习课程中,从 coursera https://www.coursera.org/learn/nlp-sequence-models?specialization=deep-learning 你可以看到 Andrew NG 没有切换上下文目标概念。这意味着目标词将始终被视为要预测的词,无论是 CBOW 还是 skip-gram。