继续训练 Doc2Vec 模型

Question

Gensim 的 official tutorial 明确指出可以继续训练（加载的）模型。我知道根据文档无法继续训练从 word2vec 格式加载的模型。但是，即使从头开始生成模型然后尝试调用 train 方法，也无法访问为提供给 train.[=] 的 LabeledSentence 个实例新创建的标签。 16=]

>>> sentences = [LabeledSentence(['first', 'sentence'], ['SENT_0']), LabeledSentence(['second', 'sentence'], ['SENT_1'])]
>>> model = Doc2Vec(sentences, min_count=1)
>>> print(model.vocab.keys())
dict_keys(['SENT_0', 'SENT_1', 'sentence', 'first', 'second'])
>>> sentence = LabeledSentence(['third', 'sentence'], ['SENT_2'])
>>> model.train([sentence])
>>> print(model.vocab.keys())

# At this point I would expect the key 'SENT_2' to be present in the vocabulary, but it isn't
dict_keys(['SENT_0', 'SENT_1', 'sentence', 'first', 'second'])

是否可以在 Gensim 中用新句子继续训练 Doc2Vec 模型？如果可以，如何实现？

Answer 1

我的理解是，这对于任何新标签都是不可能的。只有当新数据与旧数据具有相同的标签时，我们才能继续训练。结果，我们正在训练或重新调整已学词汇的权重，但无法学习新词汇。

训练时新增labels/words/sentences有类似问题：https://groups.google.com/forum/#!searchin/word2vec-toolkit/online$20word2vec/word2vec-toolkit/L9zoczopPUQ/_Zmy57TzxUQJ

此外，您可能希望关注此讨论： https://groups.google.com/forum/#!topic/gensim/UZDkfKwe9VI

更新：如果你想向已经训练好的模型中添加新词，请在此处查看在线 word2vec： http://rutumulkar.com/blog/2015/word2vec/

Answer 2

根据 gensim 文档online/incrementaldoc2vec 不支持训练。

参考https://github.com/RaRe-Technologies/gensim/issues/1019

我仍然可以将新文档添加到现有的 doc2vec 模型（但有些它会由于分段错误而崩溃）但是大多数类似的查询不适用于新添加的文档（因此这种方法似乎没有用）。

继续训练 Doc2Vec 模型

Continue training a Doc2Vec model

neural-network

gensim