Doc2Vec 模型将文档标签拆分为符号

Question

我正在使用 gensim 3.0.1。

我有一个 TaggedDocument 的列表，带有 "label_17" 形式的唯一标签，但是当我训练 Doc2Vec 模型时，它以某种方式将标签拆分为符号，因此 [=15= 的输出] 如下：

{'0': Doctag(offset=5, word_count=378, doc_count=40),
 '1': Doctag(offset=6, word_count=1330, doc_count=141),
 '2': Doctag(offset=7, word_count=413, doc_count=50),
 '3': Doctag(offset=8, word_count=365, doc_count=41),
 '4': Doctag(offset=9, word_count=395, doc_count=41),
 '5': Doctag(offset=10, word_count=420, doc_count=41),
 '6': Doctag(offset=11, word_count=408, doc_count=41),
 '7': Doctag(offset=12, word_count=426, doc_count=41),
 '8': Doctag(offset=13, word_count=385, doc_count=41),
 '9': Doctag(offset=14, word_count=376, doc_count=40),
 '_': Doctag(offset=4, word_count=2009, doc_count=209),
 'a': Doctag(offset=1, word_count=2009, doc_count=209),
 'b': Doctag(offset=2, word_count=2009, doc_count=209),
 'e': Doctag(offset=3, word_count=2009, doc_count=209),
 'l': Doctag(offset=0, word_count=4018, doc_count=418)}

但在最初的标记文档列表中，每个文档都有自己唯一的标签。

模型训练代码如下：

model = Doc2Vec(size=300, sample=1e-4, workers=2)
print('Building Vocabulary')
model.build_vocab(data)
print('Training...')
model.train(data, total_words=total_words_count, epochs=20)

因此我无法像 model.docvecs['label_17'] 那样索引我的文档并得到 KeyError.

如果我将数据传递给构造函数而不是构建词汇表，情况相同。

为什么会这样？谢谢。

Answer 1

Doc2Vec 期望文本示例、形状为 TaggedDocument 的对象具有 tags 属性，即 标签列表。

如果您改为提供一个字符串，例如 'label_17'，它实际上是一个 *list-of-characters*, so it's essentially saying thatTaggedDocument` 具有标签：

['l', 'a', 'b', 'e', 'l', '_', '1', '7']

确保将 tags 设为单标签列表，例如 tags=['label_17']，您应该会看到训练标签方面的结果更符合您的预期。

另外：您似乎有大约 200 个文档，每个文档大约 10 个单词。注意 Word2Vec/Doc2Vec 需要大量不同的数据集才能获得好的结果。特别是只有 200 个文本但有 300 个向量维度，训练可以很好地完成训练任务（内部词预测），只需记住训练集的特质，这本质上是 'overfitting' 而不是产生的向量 distances/arrangement 表示可以转移到其他示例的可概括知识。

Doc2Vec 模型将文档标签拆分为符号

Doc2Vec model splits documents tags in symbols

python-3.x

gensim

doc2vec