python error:" 'numpy.ndarray' object has no attribute 'words' " when training doc2vec

Question

当我训练我的 doc2vec 模型时，我多次通过数据集并每次都打乱训练评论以提高准确性。然后 python 给了我 AttributeError: 'numpy.ndarray' object has no attribute 'words'。下面是我的 python 代码：

def labelizeReviews(reviews, label_type):
  labelized = []
  for index, review in enumerate(reviews):
      label = ' %s_%s ' % (label_type, index)
      labelized.append(LabeledSentence(review, [label]))
  return labelized

x_train = labelizeReviews(x_train, 'TRAIN')  # input x_train is a list of word lists, each word list is a list of tokens of all words in one document
x_train=np.array(x_train)
model_dm = gensim.models.Doc2Vec(alpha=0.025, min_alpha=0.0001, iter=10, min_count=5, window=10, size=size, sample=1e-3,
                                 negative=5, workers=3)
for epoch in range(10):
    perm = np.random.permutation(x_train.shape[0])
    model_dm.train(x_train[perm], total_examples=model_dbow.corpus_count, epochs=model_dbow.iter)

然后下面是我的错误信息：

Exception in thread Thread-4:
Traceback (most recent call last):
  File "C:\Users3\Anaconda2\lib\threading.py", line 801, in __bootstrap_inner
    self.run()
  File "C:\Users3\Anaconda2\lib\threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "C:\Users3\Anaconda2\lib\site-packages\gensim-2.1.0-py2.7-win-amd64.egg\gensim\models\word2vec.py", line 857, in job_producer
    sentence_length = self._raw_word_count([sentence])
  File "C:\Users3\Anaconda2\lib\site-packages\gensim-2.1.0-py2.7-win-amd64.egg\gensim\models\doc2vec.py", line 729, in _raw_word_count
    return sum(len(sentence.words) for sentence in job)
  File "C:\Users3\Anaconda2\lib\site-packages\gensim-2.1.0-py2.7-win-amd64.egg\gensim\models\doc2vec.py", line 729, in <genexpr>
    return sum(len(sentence.words) for sentence in job)
AttributeError: 'numpy.ndarray' object has no attribute 'words'

有谁知道如何解决这个问题？非常感谢！！！

Answer 1

选择一个好的 demo/tutorial 作为您的向导 – 首先运行它以查看正确的操作，然后调整它以使用您的数据或参数。

例如gensim自带的Doc2Vec介绍Jupyter notebook，doc2vec-lee.ipynb。您可以在安装的 gensim 目录中的 docs/notebooks 子目录中找到它，或者在线查看它：

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb

现在，该演示是在不切实际的小玩具数据集上进行的 - 只有 300 个简短的几百字文档。 Doc2Vec 通常不会在如此小的数据集上给出好的结果。但是此演示使用非典型的小 size（50 维）和非典型的大 iter (55) 来维持一些有用性。

（对于更典型的数万到数百万文档的训练集，您可以使用更典型的 size 100 或更多维度，更典型的 iter 只有 10-20 .).

但是，如果您建立在像这样的良好、有效的示例之上，您就不会犯某些错误。例如：

您将使用当前推荐的示例 class、TaggedDocument，而不是其旧变体 LabeledSentence.
您不会将您的语料库变成一个 numpy ndarray——这是一个完全不必要的步骤，也是您所看到的错误的近因。
你不会在你自己的循环中多次调用 train()，这是容易出错的，而且几乎总是错误的事情，除非你是一个小心翼翼的专家用户注意所有参数管理。（你正在做 10 个循环，在每个循环中对数据进行 10 次传递，对于每个循环，class 将管理学习率 alpha 从 0.025 到 0.0001 – 这意味着它会跳起来在训练期间下降，这几乎肯定不是你想要的。）
您不会让每个文档都具有相同的单个标签 'TRAIN`` – which meansDoc2Vec 不可能做任何有用的事情。该算法需要具有不同标签的各种文档来学习不同 documents/tags 的对比向量。