如何通过 Gensim TaggedDocument() 正确标记文档列表

Question

我想用 Gensim TaggedDocument() 标记文档列表，然后将这些文档作为 Doc2Vec() 的输入传递。

我已经阅读了有关 TaggedDocument here 的文档，但我不明白参数 words 和 tags.[=18= 的确切含义]

我试过：

texts = [[word for word in document.lower().split()]
          for document in X.values]

texts = [[token for token in text]
          for text in texts]

model = gensim.models.Doc2Vec(texts, vector_size=200)
model.train(texts, total_examples=len(texts), epochs=10)

但是我收到错误 'list' object has no attribute 'words'。

Answer 1

Doc2Vec 需要一个可迭代的文本集合，每个文本（形状像）示例 TaggedDocument class，同时具有 words 和 tags 属性.

words 可以是您的标记化文本（作为列表），但 tags 应该是 document-tags 的列表，应该通过 [=11] 接收学习向量=]算法。大多数情况下，这些是唯一的 ID，每个文档一个。（您可以只使用普通的 int 索引，如果它可以作为在其他地方引用您的文档的一种方式，或字符串 ID。）请注意 tags 必须是 list-of-tags，即使您只提供每个文档一个。

您只是提供了 lists-of-words 的列表，因此产生了错误。

尝试只用一行来初始化 texts:

texts = [TaggedDocument(
             words=[word for word in document.lower().split()],
             tags=[i]
         ) for i, document in enumerate(X.values)]

此外，如果您在创建 Doc2Vec 时提供了 texts，则无需调用 train()。（通过在初始化时提供语料库，Doc2Vec 将自动进行初始 vocabulary-discovery 扫描，然后进行指定数量的训练。）

您应该查看工作示例以获取灵感，例如 gensim 附带的 doc2vec-lee.ipynb 可运行的 Jupyter 笔记本。这将是您的安装目录，如果您能找到它，但您也可以在 gensim 源代码存储库中查看（静态，non-runnable）版本：

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb

如何通过 Gensim TaggedDocument() 正确标记文档列表

How to properly tag a list of documenta by Gensim TaggedDocument()

nlp

gensim

doc2vec