为什么 scipy.sparse.csc_matrix.sum() 的结果将其类型更改为 numpy 矩阵？

Question

我想生成一个大的稀疏矩阵并对其求和，但我经常遇到 MemoryError。所以我尝试通过 scipy.sparse.csc_matrix.sum 进行操作，但发现在求和后数据类型变回了 numpy matrix。

window = 10    
np.random.seed = 0
mat = sparse.csc_matrix(np.random.rand(100, 120)>0.5, dtype='d')
print type(mat)
>>> <class 'scipy.sparse.csc.csc_matrix'>

mat_head = mat[:,0:window].sum(axis=1)
print type(mat_head)
>>> <class 'numpy.matrixlib.defmatrix.matrix'>

所以我生成 mat 作为零矩阵只是为了测试当 mat_head 全部为零时的结果。

mat = sparse.csc_matrix((100,120))
print type(mat)
>>> <class 'scipy.sparse.csc.csc_matrix'>
mat_head = mat.sum(axis=1)
print type(mat_head)
>>> <class 'numpy.matrixlib.defmatrix.matrix'>
print np.count_nonzero(mat_head)
>>> 0

为什么会这样？因此，通过 scipy.sparse 求和并没有比 numpy 更利于保存内存，因为它们无论如何都会将数据类型改回来？

Answer 1

就本质上是设计选择的硬性理由而言，我会提出以下论点：

csr 和 csc 格式专为稀疏但不是非常稀疏的矩阵而设计。特别是，对于一个 nxn 矩阵，它的非零值明显少于 n 个，这些格式相当浪费，因为在数据和索引之上，它们带有一个大小为 n+1 的字段 indptr（描述行或列）。

因此，假设正确使用 csc 或 csr 矩阵，可以合理地期望行或列总和不稀疏，相应的方法应该 return 密集向量。

Answer 2

csr 和 csc 格式是为线性代数开发的，尤其是大型但稀疏的线性方程的解

A*x = b
x = b/A

A 必须是可逆的，不能有全 0 的行或列。

A.sum(1) 是通过矩阵乘法完成的，有一个 1 的 (n,1) 数组。

你的 mat:

In [203]: np.allclose(mat*np.mat(np.ones((120,1))), mat.sum(1))
Out[203]: True

我自己这样做实际上要快一点（某处开销？）

In [204]: timeit mat.sum(1)
92.7 µs ± 111 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [205]: timeit mat*np.mat(np.ones((120,1)))
59.2 µs ± 53.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

我也可以用稀疏矩阵来做到这一点：

In [209]: mat*sparse.csc_matrix(np.ones((120,1)))
Out[209]: 
<100x1 sparse matrix of type '<class 'numpy.float64'>'
    with 100 stored elements in Compressed Sparse Column format>
In [211]: np.allclose(mat.sum(1),_.todense())
Out[211]: True

但是时间比较慢，即使我把稀疏创建移到循环外：

In [213]: %%timeit I=sparse.csc_matrix(np.ones((120,1)))
     ...: mat*I
     ...: 
215 µs ± 401 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

如果 mat 是 (115100,10) 并且有很多全 0 行，这种全稀疏方法可以节省时间和 space。

mat[:,:10] 也使用矩阵乘法，使用稀疏提取器矩阵执行。

其实比行总和慢：

In [247]: timeit mat[:,:10]
305 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [248]: timeit mat[:,:10].sum(1)
384 µs ± 9.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

我可以将列选择与求和结合使用：

In [252]: I = sparse.lil_matrix((120,1),dtype=int); I[:10,:]=1; I=I.tocsc()
In [253]: I
Out[253]: 
<120x1 sparse matrix of type '<class 'numpy.int64'>'
    with 10 stored elements in Compressed Sparse Column format>
In [254]: np.allclose((mat*I).todense(),mat[:,:10].sum(1))
Out[254]: True

虽然我可以改进 I 构建步骤，但此 mat*I 的时间较慢。

I = sparse.csc_matrix((np.ones(10,int), np.arange(10), np.array([0,10])), shape=(120,1))

Answer 3

我知道你 "why" 的问题主要针对设计决策背后的动机，但无论如何我追踪了 csc_matrix.sum(axis=1) 的结果实际上是如何变成一个 numpy matrix .

csc_matrix class inherits from the _cs_matrix class which inherits from the _data_matrix class which inherits from the spmatrix base class. This last one implements .sum(ax) as

if axis == 0:
    # sum over columns
    ret = np.asmatrix(np.ones(
        (1, m), dtype=res_dtype)) * self
else:
    # sum over rows
    ret = self * np.asmatrix(np.ones((n, 1), dtype=res_dtype))

换句话说，as also noted in a comment，column/row 和分别与密集的行或列矩阵相乘计算得出。此操作的结果将是您在输出中看到的密集矩阵。

虽然一些子类重写了它们的 .sum() 方法，但据我所知，这只发生在 axis=None 情况下，所以你看到的结果可以归因于上面的块代码。

为什么 scipy.sparse.csc_matrix.sum() 的结果将其类型更改为 numpy 矩阵？

Why does the result of scipy.sparse.csc_matrix.sum() change its type to numpy matrix?

python

numpy

scipy

sparse-matrix

python-2.7