Word2Vec - 向矢量表示添加约束

Word2Vec - adding constraint to vector representation

我正在尝试使预训练的 Google 新闻 word2vec 模型适应我的特定领域。对于我正在查看的领域，已知某些词彼此相似，因此在理想世界中，这些词的 Word2Vec 表示应该代表这一点。我知道我可以在特定领域数据的语料库上训练预训练模型来更新向量。

但是，如果我确定某些词高度相似并且应该放在一起，我有没有办法将该约束合并到 word2vec 模型中？从数学上讲，我想在 word2vec 的损失函数中添加一个项，如果我知道相似的两个在向量 space 中没有彼此靠近，则会提供惩罚。有没有人对如何实施这个有建议？这是否需要我解压缩 word2vec 模型，或者我是否有办法将附加项添加到损失函数中？

一种方法是采用预训练的 Google 新闻 word2vec 并使用此 "retrofitting" 工具：

Faruqui、Manaal、Jesse Dodge、Sujay K. Jauhar、Chris Dyer、Eduard Hovy 和 Noah A. Smith。 "Retrofitting word vectors to semantic lexicons." arXiv 预印本 arXiv:1411.4166 (2014)。 https://arxiv.org/abs/1411.4166

This paper proposes a method for refining vector space representations using relational information from semantic lexicons by encouraging linked words to have similar vector representations, and it makes no assumptions about how the input vectors were constructed.

代码位于 https://github.com/mfaruqui/retrofitting and is straightforward to use (I've personally used it for https://arxiv.org/abs/1607.02802)。

Word2Vec - 向矢量表示添加约束

Word2Vec - adding constraint to vector representation

nlp

stanford-nlp

word2vec