Python：由于重复索引，在稀疏矩阵 (lil_matrix) 中累积插入值

Question

我的情况如下：

我有一系列结果，比方说 S = np.array([2,3,10,-1,12,1,2,4,4])，我想根据可能重复元素（没有特定模式）的列索引数组插入 scipy.sparse.lil_matrix M 的最后一行，例如： j = np.array([3,4,5,14,15,16,3,4,5]).

当列索引重复时，应将其在S中对应值的总和插入到矩阵M中。因此，在上面的示例中，结果 [4,7,14] 应该放在 M 最后一行的第 [3,4,5] 列中。换句话说，我想实现这样的目标：

M[-1,j] = np.array([2+2,3+4,10+4,-1,12,1]).

计算速度对我的程序来说非常重要，所以我应该避免使用循环。期待您的巧妙解决方案！谢谢！

Answer 1

您可以使用 defaultdict 将 M 列索引映射到它们的值，并使用 map 函数更新此 defaultdict，如下所示：

from collections import defaultdict

d = defaultdict(int) #Use your array type here
def f(j, s):
    d[j] += s
map(f, j, S)
M[-1, d.keys()] = d.values() #keys and values are always in the same order

如果不想无用地创建 None 的列表，可以使用 filter 而不是 map：

d = defaultdict(int) #Use your array type here
def g(e):
    d[e[1]] += S[e[0]]
filter(g, enumerate(j))
M[-1, d.keys()] = d.values() #keys and values are always in the same

Answer 2

这种求和是 sparse 矩阵的正常行为，尤其是在 csr 格式中。

定义 3 个输入数组：

In [408]: S = np.array([2,3,10,-1,12,1,2,4,4])
In [409]: j=np.array([3,4,5,14,15,16,3,4,5])
In [410]: i=np.ones(S.shape,int)

coo 格式采用这 3 个数组，没有改变

In [411]: c0=sparse.coo_matrix((S,(i,j)))
In [412]: c0.data
Out[412]: array([ 2,  3, 10, -1, 12,  1,  2,  4,  4])

但是当转换为csr格式时，它会对重复的索引求和：

In [413]: c1=c0.tocsr()
In [414]: c1.data
Out[414]: array([ 4,  7, 14, -1, 12,  1], dtype=int32)
In [415]: c1.A
Out[415]: 
array([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  4,  7, 14,  0,  0,  0,  0,  0,  0,  0,  0, -1, 12,  1]], dtype=int32)

将 coo 转换为密集或数组 c0.A 时也会进行求和。

并且在转换为 lil 时：

In [419]: cl=c0.tolil()
In [420]: cl.data
Out[420]: array([[], [4, 7, 14, -1, 12, 1]], dtype=object)
In [421]: cl.rows
Out[421]: array([[], [3, 4, 5, 14, 15, 16]], dtype=object)

lil_matrix 不直接接受 (data,(i,j)) 输入，所以你必须通过 coo 如果那是你的目标。

http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.sparse.coo_matrix.html

By default when converting to CSR or CSC format, duplicate (i,j) entries will be summed together. This facilitates efficient construction of finite element matrices and the like. (see example)

要在现有 lil 中插入，请使用中间 csr:

In [443]: L=sparse.lil_matrix((3,17),dtype=S.dtype)
In [444]: L[-1,:]=sparse.csr_matrix((S,(np.zeros(S.shape),j)))
In [445]: L.A
Out[445]: 
array([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  4,  7, 14,  0,  0,  0,  0,  0,  0,  0,  0, -1, 12,  1]])

这个语句比使用 csr_matrix 的语句快；

L[-1,:]=sparse.coo_matrix((S,(np.zeros(S.shape),j)))

如果您真的担心速度，请检查 L.__setitem__。副手看起来它通常将稀疏矩阵转换为数组

L[-1,:]=sparse.coo_matrix((S,(np.zeros(S.shape),j))).A

花费相同的时间。对于像这样的小测试用例，创建中间矩阵的开销可能会淹没任何花在添加这些重复索引上的时间。

一般来说，向现有稀疏矩阵插入或追加值很慢，无论您是否进行求和。在可能的情况下，最好先为整个矩阵创建 data、i 和 j 数组，然后再创建稀疏矩阵。

Python：由于重复索引，在稀疏矩阵 (lil_matrix) 中累积插入值

Python: Cumulative insertion of values in a sparse matrix (lil_matrix) due to repeated indices

python

arrays

numpy

scipy