为什么 scipy.sparse.csc_matrix 不保留我的 np.array 的索引顺序?
Why doesn't scipy.sparse.csc_matrix preserve the indexing order of my np.array?
我正在编写代码以同时从多个大型并行 scipy sparse.csc
矩阵中删除多个列(这意味着所有矩阵都具有相同的暗淡,并且所有 nnz 元素都在相同的位置)并且有效率的。我这样做是通过仅索引到我想为一个矩阵保留的列,然后为其他矩阵重用索引和 indptr 列表。但是,当我按列表索引 csc 矩阵时,它会重新排序数据列表,因此我无法重用索引。有没有办法强制 scipy 保持数据列表的原始顺序?为什么只有在按列表索引时才重新排序?
import scipy.sparse
import numpy as np
mat = scipy.sparse.csc_matrix(np.array([[1,0,0,0,2,5],
[1,0,1,0,0,0],
[0,0,0,4,0,1],
[0,3,0,1,0,4]]))
print mat[:,3].data
returns 数组([4, 1])
print mat[:,[3]].data
returns 数组([1, 4])
In [43]: mat = sparse.csc_matrix(np.array([[1,0,0,0,2,5],[1,0,1,0,0,0],[0,0,0,4,
...: 0,1],[0,3,0,1,0,4]]))
...:
...:
In [44]: mat
Out[44]:
<4x6 sparse matrix of type '<class 'numpy.int64'>'
with 10 stored elements in Compressed Sparse Column format>
In [45]: mat.data
Out[45]: array([1, 1, 3, 1, 4, 1, 2, 5, 1, 4], dtype=int64)
In [46]: mat.indices
Out[46]: array([0, 1, 3, 1, 2, 3, 0, 0, 2, 3], dtype=int32)
In [47]: mat.indptr
Out[47]: array([ 0, 2, 3, 4, 6, 7, 10], dtype=int32)
标量选择:
In [48]: m1 = mat[:,3]
In [49]: m1
Out[49]:
<4x1 sparse matrix of type '<class 'numpy.int64'>'
with 2 stored elements in Compressed Sparse Column format>
In [50]: m1.data
Out[50]: array([4, 1])
In [51]: m1.indices
Out[51]: array([2, 3], dtype=int32)
In [52]: m1.indptr
Out[52]: array([0, 2], dtype=int32)
列表索引:
In [53]: m2 = mat[:,[3]]
In [54]: m2.data
Out[54]: array([1, 4], dtype=int64)
In [55]: m2.indices
Out[55]: array([3, 2], dtype=int32)
In [56]: m2.indptr
Out[56]: array([0, 2], dtype=int32)
排序:
In [57]: m2.sort_indices()
In [58]: m2.data
Out[58]: array([4, 1], dtype=int64)
In [59]: m2.indices
Out[59]: array([2, 3], dtype=int32)
csc indexing with a list uses matrix multiplication. It constructs an extractor matrix based on the index, and then does the dot multiply. So it's a brand new sparse matrix; not just a subset of the csc data and index attributes.
csc matrices have a method to ensure the indicies values are ordered (within a column). Applying that might help to ensure the arrays are sorted in the same way.
我正在编写代码以同时从多个大型并行 scipy sparse.csc
矩阵中删除多个列(这意味着所有矩阵都具有相同的暗淡,并且所有 nnz 元素都在相同的位置)并且有效率的。我这样做是通过仅索引到我想为一个矩阵保留的列,然后为其他矩阵重用索引和 indptr 列表。但是,当我按列表索引 csc 矩阵时,它会重新排序数据列表,因此我无法重用索引。有没有办法强制 scipy 保持数据列表的原始顺序?为什么只有在按列表索引时才重新排序?
import scipy.sparse
import numpy as np
mat = scipy.sparse.csc_matrix(np.array([[1,0,0,0,2,5],
[1,0,1,0,0,0],
[0,0,0,4,0,1],
[0,3,0,1,0,4]]))
print mat[:,3].data
returns 数组([4, 1])
print mat[:,[3]].data
returns 数组([1, 4])
In [43]: mat = sparse.csc_matrix(np.array([[1,0,0,0,2,5],[1,0,1,0,0,0],[0,0,0,4,
...: 0,1],[0,3,0,1,0,4]]))
...:
...:
In [44]: mat
Out[44]:
<4x6 sparse matrix of type '<class 'numpy.int64'>'
with 10 stored elements in Compressed Sparse Column format>
In [45]: mat.data
Out[45]: array([1, 1, 3, 1, 4, 1, 2, 5, 1, 4], dtype=int64)
In [46]: mat.indices
Out[46]: array([0, 1, 3, 1, 2, 3, 0, 0, 2, 3], dtype=int32)
In [47]: mat.indptr
Out[47]: array([ 0, 2, 3, 4, 6, 7, 10], dtype=int32)
标量选择:
In [48]: m1 = mat[:,3]
In [49]: m1
Out[49]:
<4x1 sparse matrix of type '<class 'numpy.int64'>'
with 2 stored elements in Compressed Sparse Column format>
In [50]: m1.data
Out[50]: array([4, 1])
In [51]: m1.indices
Out[51]: array([2, 3], dtype=int32)
In [52]: m1.indptr
Out[52]: array([0, 2], dtype=int32)
列表索引:
In [53]: m2 = mat[:,[3]]
In [54]: m2.data
Out[54]: array([1, 4], dtype=int64)
In [55]: m2.indices
Out[55]: array([3, 2], dtype=int32)
In [56]: m2.indptr
Out[56]: array([0, 2], dtype=int32)
排序:
In [57]: m2.sort_indices()
In [58]: m2.data
Out[58]: array([4, 1], dtype=int64)
In [59]: m2.indices
Out[59]: array([2, 3], dtype=int32)
csc indexing with a list uses matrix multiplication. It constructs an extractor matrix based on the index, and then does the dot multiply. So it's a brand new sparse matrix; not just a subset of the csc data and index attributes.
csc matrices have a method to ensure the indicies values are ordered (within a column). Applying that might help to ensure the arrays are sorted in the same way.