使用文档向量构建词汇表

Question

我无法建立词汇并出现错误：

TypeError: 'int' object is not iterable

这是我基于媒体文章的代码：

https://towardsdatascience.com/implementing-multi-class-text-classification-with-doc2vec-df7c3812824d

我尝试提供pandas系列，列表到build_vocab功能。

import pandas as pd

from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.model_selection import train_test_split
import multiprocessing
import nltk
from nltk.corpus import stopwords

def tokenize_text(text):
    tokens = []
    for sent in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sent):
            if len(word) < 2:
                continue
            tokens.append(word.lower())
    return tokens

df = pd.read_csv("https://raw.githubusercontent.com/RaRe-Technologies/movie-plots-by-genre/master/data/tagged_plots_movielens.csv")

tags_index = {
    "sci-fi": 1,
    "action": 2,
    "comedy": 3,
    "fantasy": 4,
    "animation": 5,
    "romance": 6,
}

df["tindex"] = df.tag.replace(tags_index)
df = df[["plot", "tindex"]]

mylist = list()
for i, q in df.iterrows():
    mylist.append(
        TaggedDocument(tokenize_text(str(q["plot"])), tags=q["tindex"])
    )

df["tdoc"] = mylist

X = df[["tdoc"]]
y = df["tindex"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

cores = multiprocessing.cpu_count()
model_doc2vec = Doc2Vec(
    dm=1,
    vector_size=300,
    negative=5,
    hs=0,
    min_count=2,
    sample=0,
    workers=cores,
)
model_doc2vec.build_vocab([x for x in X_train["tdoc"]])

该方法的文档非常混乱。

Answer 1

Doc2Vec 的语料库需要一个类似 TaggedDocument 的对象的可迭代序列（如被馈送到 build_vocab() 或 train()）。

显示错误时，您还应该显示伴随它的完整堆栈，以便清楚涉及哪些代码行和周围的调用框架。

但是，不清楚您输入数据框，然后通过数据框括号访问，然后通过 train_test_split() 输出的内容是否确实如此。

所以我建议将事物分配给描述性临时变量，并验证它们在每一步都包含正确种类的事物。

是 X_train["tdoc"][0] 一个合适的 TaggedDocument，words 属性是一个字符串列表，tags 属性标签列表？（而且，每个标签可能是一个字符串，但也可能是一个普通整数，从 0 开始向上计数。）

mylist[0] 是正确的 TaggedDocument 吗？

另外：网上很多Doc2Vec使用的例子都存在严重错误，你link的Medium文章也不例外。它在一个循环中多次调用 train() 的做法通常是不必要的，而且很容易出错，事实上在那篇文章中导致严重的学习率 alpha 管理不善。（例如，从 0.025 的 starting-default alpha 中减去 0.002 30 次导致负有效 alpha，即 never 是合理的，意味着模型在每个示例中都变得更糟。这可能是导致报告的分类器准确度糟糕的一个因素。）

我会完全忽略那篇文章并在别处寻找更好的例子。

使用文档向量构建词汇表

Building vocabulary using document vector

doc2vec