在 word2vec 或 Glove 中添加额外的单词（可能使用 gensim）

Question

我有两个预训练词嵌入：Glove.840b.300.txt 和 custom_glove.300.txt

一个是斯坦福预训练的，一个是我训练的。两者都有不同的词汇集。为了减少oov，我想将没有出现在file1中但确实出现在file2中的单词添加到file1中。我如何轻松做到这一点？

这是我在 gensim 3.4.0 中加载和保存文件的方式。

from gensim.models.keyedvectors import KeyedVectors

model = KeyedVectors.load_word2vec_format('path/to/thefile')
model.save_word2vec_format('path/to/GoogleNews-vectors-negative300.txt', binary=False)

Answer 1

我不知道简单的方法。

特别是，没有一起训练的词向量不会有 compatible/comparable 坐标空间。（一个词没有一个正确的位置 - 与同一模型中的其他词相比，只是一个相对好的位置。）

因此，您不能只附加另一个模型中缺失的单词：您需要将它们转换到兼容的位置。幸运的是，似乎可以使用一些共享的锚词集来学习转换——然后应用你想要移动的词。

有一个 class、[TranslationMatrix][1] 和 demo notebook in gensim showing this process for language-translation (an application mentioned in the original word2vec papers). You could concievably use this, combined with the ability to append extra vectors to a gensim KeyedVectors 实例，用于创建一组新的向量，其中包含任一源模型中的单词的超集。

在 word2vec 或 Glove 中添加额外的单词（可能使用 gensim）

Adding additional words in word2vec or Glove (maybe using gensim)

nlp

gensim

word2vec

glove