Numpy

Question

我遇到了 numpy 数组的问题。我使用来自 sklearn 的 CountVectorizer 和一个词集和值（来自 pandas 列）来创建一个数组数组来计算单词 (BoW)。当我打印数组和形状时，我得到了这个结果：

[[array([0, 5, 0, ..., 0, 0, 0])]
 [array([0, 0, 0, ..., 0, 0, 0])]
 [array([0, 0, 0, ..., 0, 0, 0])]
 ...
 [array([0, 0, 0, ..., 0, 0, 0])]
 [array([0, 0, 0, ..., 0, 0, 0])]
 [array([0, 0, 0, ..., 0, 0, 0])]] (2800, 1)

具有向量形状的数组数组 ???

我检查过所有行的大小都相同。

这是重现我的问题的方法：

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer


data = pd.DataFrame(["blop blip blup", "bop bip bup", "boop boip boup"], columns=["corpus"])

# add labels column
data["label"] = ["blop", "bip", "boup"]

wordset = pd.Series([y for x in data["corpus"].str.split() for y in x]).unique()
cvec = CountVectorizer(vocabulary=wordset, ngram_range=(1, 2))
    
labels_count_np = data["label"].apply(lambda x: cvec.fit_transform([x]).toarray()[0]).values

print(labels_count_np, labels_count_np.shape)

应该return:

[array([1, 0, 0, 0, 0, 0, 0, 0, 0]) array([0, 0, 0, 0, 1, 0, 0, 0, 0])
 array([0, 0, 0, 0, 0, 0, 0, 0, 1])] (3,)

谁能解释一下为什么 numpy 有这种行为？

此外，我试图找到一种方法来连接多个数组，如下所示：

A = [array([1, 0, 0, 0, 0, 0, 0, 0, 0]) array([0, 0, 0, 0, 1, 0, 0, 0, 0])
 array([0, 0, 0, 0, 0, 0, 0, 0, 1])]
B = [array([0, 7, 2, 0]) array([1, 4, 0, 8])
 array([6, 1, 0, 9])]

concatenate(A,B) =>
[
  [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 2, 0],
  [0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 4, 0, 8],
  [0, 0, 0, 0, 0, 0, 0, 0, 1, 6, 1, 0, 9]
]

但是我没有找到好的方法。

Answer 1

您可以使用列表理解来连接：

C = [np.append(x, B[i]) for i, x in enumerate(A)]

输出

[array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 2, 0]), 
 array([0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 4, 0, 8]), 
 array([0, 0, 0, 0, 0, 0, 0, 0, 1, 6, 1, 0, 9])]

Answer 2

来自 DataFrame 的

values，即使它只有一列，也将是 2d。 values 来自系列，帧的一列将为 1d。

如果 labels_count_np 是 (2800, 1) 形状，你可以很容易地用 labels_count_np[:,0] 或 np.squeeze(labels...) 使它成为 1d。这只是基本的 numpy.

它仍然是一个包含数组的对象数据类型数组，数据框单元格的元素。如果这些数组的大小都相同，那么

 np.stack(labels_count_np[:,0])

应该创建一个二维数值数组。

用数组元素制作一个框架：

In [35]: df = pd.DataFrame([None,None,None], columns=['x'])
In [36]: df
Out[36]: 
      x
0  None
1  None
2  None
In [37]: for i in range(3):df['x'][i] = np.zeros(4,int)
In [38]: df
Out[38]: 
              x
0  [0, 0, 0, 0]
1  [0, 0, 0, 0]
2  [0, 0, 0, 0]

帧中的二维数组：

In [39]: df.values
Out[39]: 
array([[array([0, 0, 0, 0])],
       [array([0, 0, 0, 0])],
       [array([0, 0, 0, 0])]], dtype=object)
In [40]: _.shape
Out[40]: (3, 1)

来自系列：

In [41]: df['x'].values
Out[41]: 
array([array([0, 0, 0, 0]), array([0, 0, 0, 0]), array([0, 0, 0, 0])],
      dtype=object)
In [42]: _.shape
Out[42]: (3,)

将系列值连接到一个二维数组中：

In [43]: np.stack(df['x'].values)
Out[43]: 
array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]])

Numpy - 数组数组识别为向量

Numpy - array of arrays recognize as vector

python

pandas

countvectorizer

numpy-ndarray