如何正确使用 numpy hstack
how to properly use numpy hstack
我有一份文件清单。
我使用 TfidfVectorizer
得到 dt_matrix
,这是一个稀疏矩阵 <class 'scipy.sparse.csr.csr_matrix'>
comments = get_comments()
tfidf_vector = TfidfVectorizer(tokenizer=tokenizer, lowercase=False)
dt_matrix = tfidf_vector.fit_transform(comments)
dt_matrix
是这样的:
(0, 642) 0.14738966496831196
(0, 1577) 0.20377626427753473
(0, 1166) 0.2947793299366239
: :
(1046, 166) 0.500700591796996
现在我想将文档的长度作为特征添加到这个矩阵中。
所以我有 length
数组。第i个位置是第i个文档的长度。
length=get_comments_length()
length
是一个 numpy 数组,像这样:
[141 56 79 ... 26 26 26]
我试着做 hstack
:
features = np.hstack((dt_matrix, length))
我得到这个输出:
ValueError: Found input variables with inconsistent numbers of samples: [1048, 1047]
我打印了形状:
print(np.shape(length))
print(np.shape(dt_matrix))
输出为:
(1047,)
(1047, 2078)
我做错了什么?
编辑:
sparse.hstack((dt_matrix, length.reshape((length.shape[0], 1))))
这是工作代码。使用 scipy
中的 sparse
,感谢@hpaulij 和@kederrak 的帮助
您可以使用:
np.hstack((dt_matrix, length.reshape((1047, 1))))
或:
np.hstack((dt_matrix, length.reshape((length.shape[0], 1))))
来自 docs:
Parameters: tup : sequence of ndarrays
The arrays must have the same shape along all but the second axis
In [123]: from scipy import sparse
制作一个scipy.sparse矩阵:
In [124]: M = sparse.random(5,4,.2)
In [125]: M
Out[125]:
<5x4 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in COOrdinate format>
In [126]: print(M)
(0, 3) 0.006222105671732758
(1, 0) 0.7198559134274957
(2, 0) 0.3603986399431639
(4, 2) 0.9519927602284366
In [127]: M.A
Out[127]:
array([[0. , 0. , 0. , 0.00622211],
[0.71985591, 0. , 0. , 0. ],
[0.36039864, 0. , 0. , 0. ],
[0. , 0. , 0. , 0. ],
[0. , 0. , 0.95199276, 0. ]])
In [128]: type(M)
Out[128]: scipy.sparse.coo.coo_matrix
正在尝试使用 hstack
:
In [129]: np.hstack([M, np.arange(5)[:,None]])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-129-f06fc972039d> in <module>
----> 1 np.hstack([M, np.arange(5)[:,None]])
<__array_function__ internals> in hstack(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/numpy/core/shape_base.py in hstack(tup)
341 # As a special case, dimension 0 of 1-dimensional arrays is "horizontal"
342 if arrs and arrs[0].ndim == 1:
--> 343 return _nx.concatenate(arrs, 0)
344 else:
345 return _nx.concatenate(arrs, 1)
<__array_function__ internals> in concatenate(*args, **kwargs)
ValueError: all the input arrays must have same number of dimensions,
but the array at index 0 has 1 dimension(s) and the array at index 1
has 2 dimension(s)
正确使用sparse.hstack
:
In [130]: sparse.hstack([M, np.arange(5)[:,None]])
Out[130]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 8 stored elements in COOrdinate format>
In [131]: _.A
Out[131]:
array([[0. , 0. , 0. , 0.00622211, 0. ],
[0.71985591, 0. , 0. , 0. , 1. ],
[0.36039864, 0. , 0. , 0. , 2. ],
[0. , 0. , 0. , 0. , 3. ],
[0. , 0. , 0.95199276, 0. , 4. ]])
如果第二个数组的形状是 (5,) 而不是 (5,1) 我得到你的最新错误:
In [132]: sparse.hstack([M, np.arange(5)])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-132-defd4158f59e> in <module>
----> 1 sparse.hstack([M, np.arange(5)])
/usr/local/lib/python3.6/dist-packages/scipy/sparse/construct.py in hstack(blocks, format, dtype)
463
464 """
--> 465 return bmat([blocks], format=format, dtype=dtype)
466
467
/usr/local/lib/python3.6/dist-packages/scipy/sparse/construct.py in bmat(blocks, format, dtype)
584 exp=brow_lengths[i],
585 got=A.shape[0]))
--> 586 raise ValueError(msg)
587
588 if bcol_lengths[j] == 0:
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 1, expected 5.
我有一份文件清单。
我使用 TfidfVectorizer
得到 dt_matrix
,这是一个稀疏矩阵 <class 'scipy.sparse.csr.csr_matrix'>
comments = get_comments()
tfidf_vector = TfidfVectorizer(tokenizer=tokenizer, lowercase=False)
dt_matrix = tfidf_vector.fit_transform(comments)
dt_matrix
是这样的:
(0, 642) 0.14738966496831196
(0, 1577) 0.20377626427753473
(0, 1166) 0.2947793299366239
: :
(1046, 166) 0.500700591796996
现在我想将文档的长度作为特征添加到这个矩阵中。
所以我有 length
数组。第i个位置是第i个文档的长度。
length=get_comments_length()
length
是一个 numpy 数组,像这样:
[141 56 79 ... 26 26 26]
我试着做 hstack
:
features = np.hstack((dt_matrix, length))
我得到这个输出:
ValueError: Found input variables with inconsistent numbers of samples: [1048, 1047]
我打印了形状:
print(np.shape(length))
print(np.shape(dt_matrix))
输出为:
(1047,)
(1047, 2078)
我做错了什么?
编辑:
sparse.hstack((dt_matrix, length.reshape((length.shape[0], 1))))
这是工作代码。使用 scipy
中的 sparse
,感谢@hpaulij 和@kederrak 的帮助
您可以使用:
np.hstack((dt_matrix, length.reshape((1047, 1))))
或:
np.hstack((dt_matrix, length.reshape((length.shape[0], 1))))
来自 docs:
Parameters: tup : sequence of ndarrays
The arrays must have the same shape along all but the second axis
In [123]: from scipy import sparse
制作一个scipy.sparse矩阵:
In [124]: M = sparse.random(5,4,.2)
In [125]: M
Out[125]:
<5x4 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in COOrdinate format>
In [126]: print(M)
(0, 3) 0.006222105671732758
(1, 0) 0.7198559134274957
(2, 0) 0.3603986399431639
(4, 2) 0.9519927602284366
In [127]: M.A
Out[127]:
array([[0. , 0. , 0. , 0.00622211],
[0.71985591, 0. , 0. , 0. ],
[0.36039864, 0. , 0. , 0. ],
[0. , 0. , 0. , 0. ],
[0. , 0. , 0.95199276, 0. ]])
In [128]: type(M)
Out[128]: scipy.sparse.coo.coo_matrix
正在尝试使用 hstack
:
In [129]: np.hstack([M, np.arange(5)[:,None]])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-129-f06fc972039d> in <module>
----> 1 np.hstack([M, np.arange(5)[:,None]])
<__array_function__ internals> in hstack(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/numpy/core/shape_base.py in hstack(tup)
341 # As a special case, dimension 0 of 1-dimensional arrays is "horizontal"
342 if arrs and arrs[0].ndim == 1:
--> 343 return _nx.concatenate(arrs, 0)
344 else:
345 return _nx.concatenate(arrs, 1)
<__array_function__ internals> in concatenate(*args, **kwargs)
ValueError: all the input arrays must have same number of dimensions,
but the array at index 0 has 1 dimension(s) and the array at index 1
has 2 dimension(s)
正确使用sparse.hstack
:
In [130]: sparse.hstack([M, np.arange(5)[:,None]])
Out[130]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 8 stored elements in COOrdinate format>
In [131]: _.A
Out[131]:
array([[0. , 0. , 0. , 0.00622211, 0. ],
[0.71985591, 0. , 0. , 0. , 1. ],
[0.36039864, 0. , 0. , 0. , 2. ],
[0. , 0. , 0. , 0. , 3. ],
[0. , 0. , 0.95199276, 0. , 4. ]])
如果第二个数组的形状是 (5,) 而不是 (5,1) 我得到你的最新错误:
In [132]: sparse.hstack([M, np.arange(5)])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-132-defd4158f59e> in <module>
----> 1 sparse.hstack([M, np.arange(5)])
/usr/local/lib/python3.6/dist-packages/scipy/sparse/construct.py in hstack(blocks, format, dtype)
463
464 """
--> 465 return bmat([blocks], format=format, dtype=dtype)
466
467
/usr/local/lib/python3.6/dist-packages/scipy/sparse/construct.py in bmat(blocks, format, dtype)
584 exp=brow_lengths[i],
585 got=A.shape[0]))
--> 586 raise ValueError(msg)
587
588 if bcol_lengths[j] == 0:
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 1, expected 5.