高效切片三角稀疏矩阵

Question

我有一个稀疏的三角矩阵（例如距离矩阵）。实际上，这将是一个 > 1M x 1M 的距离矩阵，具有高稀疏性。

from scipy.sparse import csr_matrix
X = csr_matrix([
      [1, 2, 3, 3, 1],
      [0, 1, 3, 3, 2],
      [0, 0, 1, 1, 3],
      [0, 0, 0, 1, 3],
      [0, 0, 0, 0, 1],
])

我想将这个矩阵子集化为另一个三角距离矩阵。索引的顺序可能不同 and/or 重复。

idx = np.matrix([1,2,4,2])
X2 = X[idx.T, idx]

这可能导致生成的矩阵不是三角形的，其中缺少一些值上三角，一些值在下三角中重复。

>>> X2.toarray()
array([[1, 3, 2, 3],
       [0, 1, 3, 1],
       [0, 0, 1, 0],
       [0, 1, 3, 1]])

如何尽可能高效地得到正确的上三角矩阵？目前，我在子集化之前镜像矩阵，然后将其子集化到三角形，但这感觉不是特别有效，因为它至少需要复制所有条目。

# use transpose method, see 
X = X + X.T - scipy.sparse.diags(X.diagonal())
X2 = X[idx.T, idx]
X2 = scipy.sparse.triu(X2, k=0, format="csr")

>>> X2.toarray()
array([[1., 3., 2., 3.],
       [0., 1., 3., 1.],
       [0., 0., 1., 3.],
       [0., 0., 0., 1.]])

Answer 1

好吧，我无法将其本地化到 triu，但这应该更快：

idx = np.array([1,2,4,2])
i = np.stack(np.meshgrid(idx, idx))
X2 = X[i.min(0), i.max(0)]
 
array([[1, 3, 2, 3],
       [3, 1, 3, 1],
       [2, 3, 1, 3],
       [3, 1, 3, 1]])

所以整个过程是：

idx = np.array([1,2,4,2])
i = np.stack(np.meshgrid(idx, idx))
X2 = scipy.sparse.triu(X[i.min(0), i.max(0)], k=0, format="csr")

但我无法摆脱必须有更优化的方式的感觉。

Answer 2

这里有一个不涉及镜像数据的方法，而是通过操作稀疏索引来达到预期的结果：

import scipy.sparse as sp

X2 = X[idx.T, idx]

# Extract indices and data (this is essentially COO format)
i, j, data = sp.find(X2)

# Generate indices with elements moved to upper triangle
ij = np.vstack([
  np.where(i > j, j, i),
  np.where(i > j, i, j)
])

# Remove duplicate elements
ij, ind = np.unique(ij, axis=1, return_index=True)

# Re-build the matrix
X2 = sp.coo_matrix((data[ind], ij)).tocsr()

Answer 3

这不是改进的工作答案，而是对稀疏索引和 triu 功能的探索。它可能会为您提供进行更直接计算的想法。事实上，你从一个 tri 开始，并期待一个 tri，这意味着这不是一个微不足道的任务，即使是密集数组（它的索引速度要快得多）也是如此。

sparse.csr 索引使用矩阵乘法。我将用密集数组来说明这一点：

In [304]: X = np.array([
     ...:       [1, 2, 3, 3, 1],
     ...:       [0, 1, 3, 3, 2],
     ...:       [0, 0, 1, 1, 3],
     ...:       [0, 0, 0, 1, 3],
     ...:       [0, 0, 0, 0, 1],
     ...: ])
In [305]: idx = np.array([1,2,4,2])
In [306]: X[idx[:,None],idx]
Out[306]: 
array([[1, 3, 2, 3],
       [0, 1, 3, 1],
       [0, 0, 1, 0],
       [0, 1, 3, 1]])
In [307]: m = np.array([[0,1,0,0,0],[0,0,1,0,0],[0,0,0,0,1],[0,0,1,0,0]])
In [308]: m@X@m.T
Out[308]: 
array([[1, 3, 2, 3],
       [0, 1, 3, 1],
       [0, 0, 1, 0],
       [0, 1, 3, 1]])

和全距离阵列：

In [309]: X2 = X+X.T-np.diag(np.diag(X))
In [311]: X2[idx[:,None],idx]
Out[311]: 
array([[1, 3, 2, 3],
       [3, 1, 3, 1],
       [2, 3, 1, 3],
       [3, 1, 3, 1]])
In [312]: m@X2@m.T
Out[312]: 
array([[1, 3, 2, 3],
       [3, 1, 3, 1],
       [2, 3, 1, 3],
       [3, 1, 3, 1]])

我不知道是否可以直接从 X（或 X2）

In [316]: sparse.triu(Out[312])
Out[316]: 
<4x4 sparse matrix of type '<class 'numpy.int64'>'
    with 10 stored elements in COOrdinate format>
In [317]: _.A
Out[317]: 
array([[1, 3, 2, 3],
       [0, 1, 3, 1],
       [0, 0, 1, 3],
       [0, 0, 0, 1]])

sparse.triu 是：

In [331]: A = sparse.coo_matrix(_312)
     ...: mask = A.row <= A.col 
In [332]: A
Out[332]: 
<4x4 sparse matrix of type '<class 'numpy.int64'>'
    with 16 stored elements in COOrdinate format>
In [333]: mask
Out[333]: 
array([ True,  True,  True,  True, False,  True,  True,  True, False,
       False,  True,  True, False, False, False,  True])

这个mask数组是16项，A.nnz.

然后创建一个新的 coo 矩阵，其中 data/row/col 个数组选自 A 个属性：

In [334]: d=A.data[mask]
In [335]: r=A.row[mask]
In [336]: c=A.col[mask]
In [337]: d
Out[337]: array([1, 3, 2, 3, 1, 3, 1, 1, 3, 1])
In [338]: sparse.coo_matrix((d, (r,c)))
Out[338]: 
<4x4 sparse matrix of type '<class 'numpy.int64'>'
    with 10 stored elements in COOrdinate format>
In [339]: _.A
Out[339]: 
array([[1, 3, 2, 3],
       [0, 1, 3, 1],
       [0, 0, 1, 3],
       [0, 0, 0, 1]])

np.triu 使用 mask 如：

In [349]: np.tri(4,4,-1)
Out[349]: 
array([[0., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 1., 0.]])

Answer 4

为了总结所有优秀的贡献，这个问题的简短答案是：

不要使用三角矩阵。与使用方阵相比，在速度或内存方面没有任何优势。

原因在中有解释：

对稀疏矩阵进行切片使用效率极高的矩阵乘法。将矩阵重新排列成三角形会很慢。
使用 triu 是一个昂贵的操作，因为它实现了一个密集的掩码。

将与仅使用方阵进行比较时，这一点变得很明显。使用正方形更快，占用的内存更少。

测试使用稀疏三角形 800k x 800k 距离矩阵，具有高稀疏性 (%nnz << 1%)。数据和代码可用 here.

# Running benchmark: Converting to square matrix
./benchmark.py squarify   6.29s  user 1.59s system 80% cpu 9.738 total
max memory:                4409 MB

# Running benchmark: @jakevdp's solution
./benchmark.py sparse_triangular   67.03s  user 3.01s system 99% cpu 1:10.15 total
max memory:                5209 MB

如果迫切希望在使用方阵之外对其进行优化，是一个很好的起点：

I would consider implementing this as a compressed distance matrix in the same style that pdist does, but as a 1xN CSR matrix, and then using coordinate math to reindex it when you need to get specific values.

高效切片三角稀疏矩阵

Efficiently slice triangular sparse matrix

python

numpy

scipy

sparse-matrix