如何计算 scipy.sparse.csr.csr_matrix 列表的余弦相似度
How to compute the cosine similarity of a list of scipy.sparse.csr.csr_matrix
我有一个稀疏向量列表:
print(type(downsample_matrix)) # Display <class 'list'>
print(type(downsample_matrix[0])) # Display <class 'scipy.sparse.csr.csr_matrix'>
我想在 downsampled_matrix
上使用函数 scikit learn cosine_similarity
但我收到以下错误:
ValueError Traceback (most recent call last)
<ipython-input-27-5997ca6abb2d> in <module>()
19 downsample_matrix.append(vector)
20 downsample_coefficient = 0
---> 21 similarity_matrix = cosine_similarity(downsample_matrix)
22 plt.matshow(similarity_matrix)
23 plt.show()
/home/venv/lib/python3.5/site-packages/sklearn/metrics/pairwise.py in cosine_similarity(X, Y, dense_output)
908 # to avoid recursive import
909
--> 910 X, Y = check_pairwise_arrays(X, Y)
911
912 X_normalized = normalize(X, copy=True)
/home/venv/lib/python3.5/site-packages/sklearn/metrics/pairwise.py in check_pairwise_arrays(X, Y, precomputed, dtype)
104 if Y is X or Y is None:
105 X = Y = check_array(X, accept_sparse='csr', dtype=dtype,
--> 106 warn_on_dtype=warn_on_dtype, estimator=estimator)
107 else:
108 X = check_array(X, accept_sparse='csr', dtype=dtype,
/home/venv/lib/python3.5/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
380 force_all_finite)
381 else:
--> 382 array = np.array(array, dtype=dtype, order=order, copy=copy)
383
384 if ensure_2d:
ValueError: setting an array element with a sequence.
当我的列表由 nd.array
:
组成时我没有问题
print(type(downsample_matrix)) # Display <class 'list'>
print(type(downsample_matrix[0])) # Display <class 'numpy.ndarray'>
如何在我的稀疏向量列表上应用 cosine_similarity?
试试这个
from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse
import numpy as np
A = np.array([[0, 1, 2, 0, 0], [0, 0, 1, 1, 2],[0, 1, 0, 1, 0]])
A_sparse = sparse.csr_matrix(A)
similarities = cosine_similarity(A_sparse)
print('pairwise dense output:\n {}\n'.format(similarities))
结果
pairwise dense output:
[[ 1. 0.36514837 0.31622777]
[ 0.36514837 1. 0.28867513]
[ 0.31622777 0.28867513 1. ]]
我的scipy
print (scipy.__version__)
0.19.0
创建一个小型稀疏矩阵。请注意,它不是 ndarray
的子类。它将数据存储在 3 个数组中 - 数据和索引:
In [196]: M = sparse.csr_matrix([[0,1,0],[1,0,1]])
In [197]: M
Out[197]:
<2x3 sparse matrix of type '<class 'numpy.int32'>'
with 3 stored elements in Compressed Sparse Row format>
In [198]: M.data
Out[198]: array([1, 1, 1], dtype=int32)
In [199]: M.indices
Out[199]: array([1, 0, 2], dtype=int32)
In [200]: M.indptr
Out[200]: array([0, 1, 3], dtype=int32)
如果我尝试从这个矩阵的列表中创建一个数组,我会得到一个对象 dtype 数组,其中包含 3 个元素(指向这个矩阵的指针):
In [201]: alist = [M,M,M]
In [202]: np.array(alist)
Out[202]: /usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py:294: SparseEfficiencyWarning: Comparing sparse matrices using >= and <= is inefficient, using <, >, or !=, instead.
"using <, >, or !=, instead.", SparseEfficiencyWarning)
array([ <2x3 sparse matrix of type '<class 'numpy.int32'>'
with 3 stored elements in Compressed Sparse Row format>,
<2x3 sparse matrix of type '<class 'numpy.int32'>'
with 3 stored elements in Compressed Sparse Row format>,
<2x3 sparse matrix of type '<class 'numpy.int32'>'
with 3 stored elements in Compressed Sparse Row format>], dtype=object)
如果我另外指定数据类型,我会得到你的错误:
In [203]: np.array(alist,dtype=int)
...
ValueError: setting an array element with a sequence.
无法将列表转换为数字数组。
但是如果它是一个密集数组列表,我得到一个 3d 数组:
In [204]: np.array([M.A,M.A,M.A],dtype=int)
Out[204]:
array([[[0, 1, 0],
[1, 0, 1]],
[[0, 1, 0],
[1, 0, 1]],
[[0, 1, 0],
[1, 0, 1]]])
In [205]: _.shape
Out[205]: (3, 2, 3)
我还可以将稀疏矩阵与 vstack
或 hstack
的稀疏版本连接起来。
In [206]: sparse.vstack(alist)
Out[206]:
<6x3 sparse matrix of type '<class 'numpy.int32'>'
with 9 stored elements in Compressed Sparse Row format>
In [207]: _.A
Out[207]:
array([[0, 1, 0],
[1, 0, 1],
[0, 1, 0],
[1, 0, 1],
[0, 1, 0],
[1, 0, 1]], dtype=int32)
注意形状 (6,3)。稀疏矩阵总是二维的。
sparse.vstack
将任务传递给 sparse.bmat
,它从 'blocks' 构造一个新的稀疏矩阵。它通过将块的 coo
表示与适当的偏移量连接起来来实现。
由于 cosine_similarity
需要二维数组或稀疏矩阵,因此您必须使用 sparse.vstack
来连接矩阵。或者重塑 3d 数组 join
的结果
In [212]: cosine_similarity(sparse.vstack(alist))
Out[212]:
array([[ 1., 0., 1., 0., 1., 0.],
[ 0., 1., 0., 1., 0., 1.],
[ 1., 0., 1., 0., 1., 0.],
[ 0., 1., 0., 1., 0., 1.],
[ 1., 0., 1., 0., 1., 0.],
[ 0., 1., 0., 1., 0., 1.]])
In [213]: cosine_similarity( np.array([M.A,M.A,M.A],dtype=int).reshape(-1,3))
Out[213]:
array([[ 1., 0., 1., 0., 1., 0.],
[ 0., 1., 0., 1., 0., 1.],
[ 1., 0., 1., 0., 1., 0.],
[ 0., 1., 0., 1., 0., 1.],
[ 1., 0., 1., 0., 1., 0.],
[ 0., 1., 0., 1., 0., 1.]])
我有一个稀疏向量列表:
print(type(downsample_matrix)) # Display <class 'list'>
print(type(downsample_matrix[0])) # Display <class 'scipy.sparse.csr.csr_matrix'>
我想在 downsampled_matrix
上使用函数 scikit learn cosine_similarity
但我收到以下错误:
ValueError Traceback (most recent call last)
<ipython-input-27-5997ca6abb2d> in <module>()
19 downsample_matrix.append(vector)
20 downsample_coefficient = 0
---> 21 similarity_matrix = cosine_similarity(downsample_matrix)
22 plt.matshow(similarity_matrix)
23 plt.show()
/home/venv/lib/python3.5/site-packages/sklearn/metrics/pairwise.py in cosine_similarity(X, Y, dense_output)
908 # to avoid recursive import
909
--> 910 X, Y = check_pairwise_arrays(X, Y)
911
912 X_normalized = normalize(X, copy=True)
/home/venv/lib/python3.5/site-packages/sklearn/metrics/pairwise.py in check_pairwise_arrays(X, Y, precomputed, dtype)
104 if Y is X or Y is None:
105 X = Y = check_array(X, accept_sparse='csr', dtype=dtype,
--> 106 warn_on_dtype=warn_on_dtype, estimator=estimator)
107 else:
108 X = check_array(X, accept_sparse='csr', dtype=dtype,
/home/venv/lib/python3.5/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
380 force_all_finite)
381 else:
--> 382 array = np.array(array, dtype=dtype, order=order, copy=copy)
383
384 if ensure_2d:
ValueError: setting an array element with a sequence.
当我的列表由 nd.array
:
print(type(downsample_matrix)) # Display <class 'list'>
print(type(downsample_matrix[0])) # Display <class 'numpy.ndarray'>
如何在我的稀疏向量列表上应用 cosine_similarity?
试试这个
from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse
import numpy as np
A = np.array([[0, 1, 2, 0, 0], [0, 0, 1, 1, 2],[0, 1, 0, 1, 0]])
A_sparse = sparse.csr_matrix(A)
similarities = cosine_similarity(A_sparse)
print('pairwise dense output:\n {}\n'.format(similarities))
结果
pairwise dense output:
[[ 1. 0.36514837 0.31622777]
[ 0.36514837 1. 0.28867513]
[ 0.31622777 0.28867513 1. ]]
我的scipy
print (scipy.__version__)
0.19.0
创建一个小型稀疏矩阵。请注意,它不是 ndarray
的子类。它将数据存储在 3 个数组中 - 数据和索引:
In [196]: M = sparse.csr_matrix([[0,1,0],[1,0,1]])
In [197]: M
Out[197]:
<2x3 sparse matrix of type '<class 'numpy.int32'>'
with 3 stored elements in Compressed Sparse Row format>
In [198]: M.data
Out[198]: array([1, 1, 1], dtype=int32)
In [199]: M.indices
Out[199]: array([1, 0, 2], dtype=int32)
In [200]: M.indptr
Out[200]: array([0, 1, 3], dtype=int32)
如果我尝试从这个矩阵的列表中创建一个数组,我会得到一个对象 dtype 数组,其中包含 3 个元素(指向这个矩阵的指针):
In [201]: alist = [M,M,M]
In [202]: np.array(alist)
Out[202]: /usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py:294: SparseEfficiencyWarning: Comparing sparse matrices using >= and <= is inefficient, using <, >, or !=, instead.
"using <, >, or !=, instead.", SparseEfficiencyWarning)
array([ <2x3 sparse matrix of type '<class 'numpy.int32'>'
with 3 stored elements in Compressed Sparse Row format>,
<2x3 sparse matrix of type '<class 'numpy.int32'>'
with 3 stored elements in Compressed Sparse Row format>,
<2x3 sparse matrix of type '<class 'numpy.int32'>'
with 3 stored elements in Compressed Sparse Row format>], dtype=object)
如果我另外指定数据类型,我会得到你的错误:
In [203]: np.array(alist,dtype=int)
...
ValueError: setting an array element with a sequence.
无法将列表转换为数字数组。
但是如果它是一个密集数组列表,我得到一个 3d 数组:
In [204]: np.array([M.A,M.A,M.A],dtype=int)
Out[204]:
array([[[0, 1, 0],
[1, 0, 1]],
[[0, 1, 0],
[1, 0, 1]],
[[0, 1, 0],
[1, 0, 1]]])
In [205]: _.shape
Out[205]: (3, 2, 3)
我还可以将稀疏矩阵与 vstack
或 hstack
的稀疏版本连接起来。
In [206]: sparse.vstack(alist)
Out[206]:
<6x3 sparse matrix of type '<class 'numpy.int32'>'
with 9 stored elements in Compressed Sparse Row format>
In [207]: _.A
Out[207]:
array([[0, 1, 0],
[1, 0, 1],
[0, 1, 0],
[1, 0, 1],
[0, 1, 0],
[1, 0, 1]], dtype=int32)
注意形状 (6,3)。稀疏矩阵总是二维的。
sparse.vstack
将任务传递给 sparse.bmat
,它从 'blocks' 构造一个新的稀疏矩阵。它通过将块的 coo
表示与适当的偏移量连接起来来实现。
由于 cosine_similarity
需要二维数组或稀疏矩阵,因此您必须使用 sparse.vstack
来连接矩阵。或者重塑 3d 数组 join
In [212]: cosine_similarity(sparse.vstack(alist))
Out[212]:
array([[ 1., 0., 1., 0., 1., 0.],
[ 0., 1., 0., 1., 0., 1.],
[ 1., 0., 1., 0., 1., 0.],
[ 0., 1., 0., 1., 0., 1.],
[ 1., 0., 1., 0., 1., 0.],
[ 0., 1., 0., 1., 0., 1.]])
In [213]: cosine_similarity( np.array([M.A,M.A,M.A],dtype=int).reshape(-1,3))
Out[213]:
array([[ 1., 0., 1., 0., 1., 0.],
[ 0., 1., 0., 1., 0., 1.],
[ 1., 0., 1., 0., 1., 0.],
[ 0., 1., 0., 1., 0., 1.],
[ 1., 0., 1., 0., 1., 0.],
[ 0., 1., 0., 1., 0., 1.]])