如何正确使用 numpy hstack

how to properly use numpy hstack

我有一份文件清单。 我使用 TfidfVectorizer 得到 dt_matrix,这是一个稀疏矩阵 <class 'scipy.sparse.csr.csr_matrix'>

comments = get_comments()
tfidf_vector = TfidfVectorizer(tokenizer=tokenizer, lowercase=False)
dt_matrix = tfidf_vector.fit_transform(comments)

dt_matrix 是这样的:

  (0, 642)  0.14738966496831196
  (0, 1577) 0.20377626427753473
  (0, 1166) 0.2947793299366239
  : :
 (1046, 166)    0.500700591796996

现在我想将文档的长度作为特征添加到这个矩阵中。 所以我有 length 数组。第i个位置是第i个文档的长度。

length=get_comments_length()

length 是一个 numpy 数组,像这样:

[141  56  79 ...  26  26  26]

我试着做 hstack:

features = np.hstack((dt_matrix, length))

我得到这个输出:

ValueError: Found input variables with inconsistent numbers of samples: [1048, 1047]

我打印了形状:

print(np.shape(length))
print(np.shape(dt_matrix))

输出为:

(1047,)
(1047, 2078)

我做错了什么?

编辑:

sparse.hstack((dt_matrix, length.reshape((length.shape[0], 1)))) 这是工作代码。使用 scipy 中的 sparse,感谢@hpaulij 和@kederrak 的帮助

您可以使用:

np.hstack((dt_matrix, length.reshape((1047, 1))))

或:

np.hstack((dt_matrix, length.reshape((length.shape[0], 1))))

来自 docs:

Parameters: tup : sequence of ndarrays

The arrays must have the same shape along all but the second axis
In [123]: from scipy import sparse  

制作一个scipy.sparse矩阵:

In [124]: M = sparse.random(5,4,.2)                                                            
In [125]: M                                                                                    
Out[125]: 
<5x4 sparse matrix of type '<class 'numpy.float64'>'
    with 4 stored elements in COOrdinate format>
In [126]: print(M)                                                                             
  (0, 3)    0.006222105671732758
  (1, 0)    0.7198559134274957
  (2, 0)    0.3603986399431639
  (4, 2)    0.9519927602284366
In [127]: M.A                                                                                  
Out[127]: 
array([[0.        , 0.        , 0.        , 0.00622211],
       [0.71985591, 0.        , 0.        , 0.        ],
       [0.36039864, 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.95199276, 0.        ]])
In [128]: type(M)                                                                              
Out[128]: scipy.sparse.coo.coo_matrix

正在尝试使用 hstack:

In [129]: np.hstack([M, np.arange(5)[:,None]])                                                 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-129-f06fc972039d> in <module>
----> 1 np.hstack([M, np.arange(5)[:,None]])

<__array_function__ internals> in hstack(*args, **kwargs)

/usr/local/lib/python3.6/dist-packages/numpy/core/shape_base.py in hstack(tup)
    341     # As a special case, dimension 0 of 1-dimensional arrays is "horizontal"
    342     if arrs and arrs[0].ndim == 1:
--> 343         return _nx.concatenate(arrs, 0)
    344     else:
    345         return _nx.concatenate(arrs, 1)

<__array_function__ internals> in concatenate(*args, **kwargs)

ValueError: all the input arrays must have same number of dimensions, 
but the array at index 0 has 1 dimension(s) and the array at index 1
has 2 dimension(s)

正确使用sparse.hstack

In [130]: sparse.hstack([M, np.arange(5)[:,None]])                                             
Out[130]: 
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 8 stored elements in COOrdinate format>
In [131]: _.A                                                                                  
Out[131]: 
array([[0.        , 0.        , 0.        , 0.00622211, 0.        ],
       [0.71985591, 0.        , 0.        , 0.        , 1.        ],
       [0.36039864, 0.        , 0.        , 0.        , 2.        ],
       [0.        , 0.        , 0.        , 0.        , 3.        ],
       [0.        , 0.        , 0.95199276, 0.        , 4.        ]])

如果第二个数组的形状是 (5,) 而不是 (5,1) 我得到你的最新错误:

In [132]: sparse.hstack([M, np.arange(5)])                                                     
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-132-defd4158f59e> in <module>
----> 1 sparse.hstack([M, np.arange(5)])

/usr/local/lib/python3.6/dist-packages/scipy/sparse/construct.py in hstack(blocks, format, dtype)
    463 
    464     """
--> 465     return bmat([blocks], format=format, dtype=dtype)
    466 
    467 

/usr/local/lib/python3.6/dist-packages/scipy/sparse/construct.py in bmat(blocks, format, dtype)
    584                                                     exp=brow_lengths[i],
    585                                                     got=A.shape[0]))
--> 586                     raise ValueError(msg)
    587 
    588                 if bcol_lengths[j] == 0:

ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 1, expected 5.