np.asarray() 给我一列数组，其中数据是多列

Question

print(X_train_bow.shape) #Output: (897, 2794)
print(type(X_train_bow)) #Output: <class 'scipy.sparse.csr.csr_matrix'>

x_train_groups = [X_train_bow[i::5] for i in range(5)]

print(x_train_groups[0].shape) #Output: (299, 2794)
print(type(X_train_bow[0])) #Output: <class 'scipy.sparse.csr.csr_matrix'>

K = 2
train_data = []
test_data = []

for j in range(0, 5):
    if(j != K):
        train_data.extend(x_train_groups[j]) 
test_data.extend(x_train_groups[K])   

print(np.asarray(train_data).shape) #Output: (598,)
print(np.asarray(test_data).shape) #Output: (299,)

我正在尝试 k 折交叉验证。所以我创建了一种合并训练和测试数据的方法。但问题是，当我调用 np.asarray 时，它 returns 与原始数据形状不同的形状数组。你可以看到代码。我还打印了输出以寻求帮助。

Answer 1

您正在调用 .extend() 并传入一个二维数组。我怀疑你的每个 train_data 元素都有 2794 "columns" 和类似的 test_data.

只需将这些直接设置为 np.arrays 而不是扩展列表。

类似于：

K = 1

for j in range(0, 3):
    if(j != K):
        try:
            np.vstack((train_data, x_train_groups[j])) 
        except NameError:
            train_data = x_train_groups[j]
test_data = x_train_groups[K]

Answer 2

如果您尝试自己实现 sklearn 的 train_test_split()，您可以使用当前代码做的是：

import numpy as np
train_data = np.array(x_train_groups)[299:, ]  
# shape: (598, 2794) by selecting row 299 onwards
test_data = np.array(x_train_groups)[0:299, ] 
# shape: (299, 2794) by selecting first 299 rows

Answer 3

让我们做一个小的演示 csr 矩阵：

In [212]: M = (sparse.random(12,3,.5, 'csr')*10).astype(int)                    
In [213]: M                                                                     
Out[213]: 
<12x3 sparse matrix of type '<class 'numpy.int64'>'
    with 18 stored elements in Compressed Sparse Row format>
In [214]: M.A                                                                   
Out[214]: 
array([[3, 1, 3],
       [0, 0, 1],
       [1, 0, 9],
       [0, 6, 0],
       [5, 4, 0],
       [4, 5, 6],
       [3, 0, 0],
       [0, 0, 5],
       [0, 0, 2],
       [0, 1, 0],
       [0, 0, 0],
       [0, 9, 0]])

您的分组生成了一个小型 csr 矩阵列表

In [216]: alist = [M[i::3] for i in range(3)]                                   
In [217]: alist                                                                 
Out[217]: 
[<4x3 sparse matrix of type '<class 'numpy.int64'>'
    with 7 stored elements in Compressed Sparse Row format>,
 <4x3 sparse matrix of type '<class 'numpy.int64'>'
    with 4 stored elements in Compressed Sparse Row format>,
 <4x3 sparse matrix of type '<class 'numpy.int64'>'
    with 7 stored elements in Compressed Sparse Row format>]

查看K案例：

In [218]: data = []                                                             
In [219]: data.extend(alist[2])                                                 
In [220]: data                                                                  
Out[220]: 
[<1x3 sparse matrix of type '<class 'numpy.int64'>'
    with 2 stored elements in Compressed Sparse Row format>,
 <1x3 sparse matrix of type '<class 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>,
 <1x3 sparse matrix of type '<class 'numpy.int64'>'
    with 1 stored elements in Compressed Sparse Row format>,
 <1x3 sparse matrix of type '<class 'numpy.int64'>'
    with 1 stored elements in Compressed Sparse Row format>]

List extend 将可迭代的元素添加到列表中（在 'flat' 意义上）。对稀疏矩阵 (alist[2]) 的迭代产生一堆 1 行稀疏矩阵（仍然是 2d）。

我们可以使用 sparse.vstack:

加入他们

In [221]: sparse.vstack(data)                                                   
Out[221]: 
<4x3 sparse matrix of type '<class 'numpy.int64'>'
    with 7 stored elements in Compressed Sparse Row format>
In [222]: sparse.vstack(data).A                                                 
Out[222]: 
array([[1, 0, 9],
       [4, 5, 6],
       [0, 0, 2],
       [0, 9, 0]])

这与子矩阵的来源相同。

In [223]: alist[2]                                                              
Out[223]: 
<4x3 sparse matrix of type '<class 'numpy.int64'>'
    with 7 stored elements in Compressed Sparse Row format>
In [224]: alist[2].A                                                            
Out[224]: 
array([[1, 0, 9],
       [4, 5, 6],
       [0, 0, 2],
       [0, 9, 0]])

将 data 列表放入 array 中只会生成 1 行稀疏矩阵的 1d 对象 dtype 数组。这些矩阵只是 np.array 的外来对象。作为一般规则，不要指望 numpy 函数对稀疏矩阵执行 'right' 操作。

In [225]: np.array(data)                                                        
Out[225]: 
array([<1x3 sparse matrix of type '<class 'numpy.int64'>'
    with 2 stored elements in Compressed Sparse Row format>,
       <1x3 sparse matrix of type '<class 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>,
       <1x3 sparse matrix of type '<class 'numpy.int64'>'
    with 1 stored elements in Compressed Sparse Row format>,
       <1x3 sparse matrix of type '<class 'numpy.int64'>'
    with 1 stored elements in Compressed Sparse Row format>], dtype=object)

不要只看形状。检查dtype，并检查一些元素！

np.asarray() 给我一列数组，其中数据是多列

np.asarray() gives me one column array where data was multi column

python

numpy

scipy

sparse-matrix

numpy-ndarray