ValueError: cannot reshape array of size 3800 into shape (1,200)

Question

我正在尝试在推文中应用词嵌入。我试图通过对推文中出现的单词的向量取平均值来为每条推文创建一个向量，如下所示：

def word_vector(tokens, size):
    vec = np.zeros(size).reshape((1, size))
    count = 0.
    for word in tokens:
        try:
            vec += model_w2v[word].reshape((1, size))
            count += 1.
        except KeyError: # handling the case where the token is not in vocabulary

            continue
    if count != 0:
        vec /= count
    return vec

接下来，当我尝试按如下方式准备 word2vec 功能集时：

wordvec_arrays = np.zeros((len(tokenized_tweet), 200))
#the length of the vector is 200

for i in range(len(tokenized_tweet)):
    wordvec_arrays[i,:] = word_vector(tokenized_tweet[i], 200)

wordvec_df = pd.DataFrame(wordvec_arrays)
wordvec_df.shape

我在循环中遇到以下错误：

ValueError                                Traceback (most recent call last)
<ipython-input-32-72aee891e885> in <module>
      4 # wordvec_arrays.reshape(1,200)
      5 for i in range(len(tokenized_tweet)):
----> 6     wordvec_arrays[i,:] = word_vector(tokenized_tweet[i], 200)
      7 
      8 wordvec_df = pd.DataFrame(wordvec_arrays)

<ipython-input-31-9e6501810162> in word_vector(tokens, size)
      4     for word in tokens:
      5         try:
----> 6             vec += model_w2v.wv.__getitem__(word).reshape((1, size))
      7             count += 1.
      8         except KeyError: # handling the case where the token is not in vocabulary

ValueError: cannot reshape array of size 3800 into shape (1,200)

我检查了 Whosebug 中所有可用的帖子，但没有一个真的对我有帮助。

我尝试重塑数组，但它仍然给我同样的错误。

我的模型是：

tokenized_tweet = df['tweet'].apply(lambda x: x.split()) # tokenizing

model_w2v = gensim.models.Word2Vec(
            tokenized_tweet,
            size=200, # desired no. of features/independent variables 
            window=5, # context window size
            min_count=2,
            sg = 1, # 1 for skip-gram model
            hs = 0,
            negative = 10, # for negative sampling
            workers= 2, # no.of cores
            seed = 34)

model_w2v.train(tokenized_tweet, total_examples= len(df['tweet']), epochs=20)

有什么建议吗？

Answer 1

看起来你的 word_vector() 方法的目的是获取一个单词列表，然后根据给定的 Word2Vec 模型，return 所有这些的平均值单词的向量（如果存在）。

为此，您不需要对向量进行任何明确的 re-shaping – 甚至 size 的规范，因为这是由模型已经提供的内容强制执行的。您可以使用 numpy 中的实用方法来大大简化代码。例如，gensim n_similarity() 方法，作为其比较 two lists-of-words 的一部分，已经像你正在尝试的那样进行平均，你可以看看它的来源作为模型：

https://github.com/RaRe-Technologies/gensim/blob/f97d0e793faa57877a2bbedc15c287835463eaa9/gensim/models/keyedvectors.py#L996

因此，虽然我没有测试过这段代码，但我认为您的 word_vector() 方法基本上可以替换为：

import numpy as np

def average_words_vectors(tokens, wv_model):
    vectors = [wv_model[word] for word in tokens 
               if word in wv_model]  # avoiding KeyError
    return np.array(vectors).mean(axis=0)

（有时，使用已归一化为 unit-length 的向量是有意义的 - 作为链接的 gensim 代码，通过将 gensim.matutils.unitvec() 应用于平均值。我在这里没有这样做，因为你的方法没有采取那一步——但这是需要考虑的事情。）

关于您的 Word2Vec 训练代码的单独观察：

通常只出现 1、2 次或几次的单词不会得到好的向量（由于示例的数量和种类有限），但是干扰其他more-common-word 向量的改进。这就是默认值为 min_count=5 的原因。所以请注意：如果您在此处使用默认（或更大）值并丢弃更多不常见的单词，您的幸存向量可能会变得更好。
像 word2vec 向量这样的 "dense embedding" 的维度并不是真正的 "independent variables"（或独立的 individually-interpretable "features"） code-comment，尽管它们在数据中看起来是分开的 values/slots。例如，您不能选择一个维度并得出结论，"that's the foo-ness of this sample"（如 'coldness' 或 'hardness' 或 'positiveness' 等）。相反，任何 human-describable 含义往往是 combined-space 中的其他方向，与任何单独的维度都不完全一致。您可以 sort-of 通过比较向量来梳理这些，下游 ML 算法可以利用这些 complicated/entangled multi-dimensional 交互。但是，如果您将每个维度都视为自己的 "feature"——除了 yes，从技术上讲，它是与项目关联的单个数字——您可能容易误解 vector-space.

ValueError: cannot reshape array of size 3800 into shape (1,200)

ValueError: cannot reshape array of size 3800 into shape (1,200)

python

tokenize

word2vec

deep-learning

word-embedding