快速查找 Python 中 CSC 矩阵中每一行的非零条目索引

Question

这是当前的实现：

def nonzero_indexes_by_row(input):
    return [
        np.nonzero(row)[1] 
        for row in csr_matrix(input.T)
    ]

矩阵非常大（1.5M，500K），因为我正在访问行，所以我必须先将 CSC 转换为 CSR。结果将是一个二维列表，每个列表包含一个非零索引列表，对应于原始矩阵中的行。

当前过程需要 20 分钟。有没有更快的方法？

Answer 1

您的代码的一个有趣的替代方法是将您的数组转换为 COOrdinate 格式，然后读取其 row 和 col 属性：

def nonzero_indices_by_coo(input):
    cx = input.T.tocoo()
    res = [ [] for i in range(cx.shape[0]) ]
    for i, j in zip(cx.row, cx.col):
        res[i].append(j)
    return res

它 returns 一个普通 pythonic 列表的列表，而不是 Numpy 数组，但这应该不是什么重要的区别。

我注意到您的代码在内部使用源数组的 t运行sposition （T 运算符）所以我在我的代码中做了同样的事情。

为了比较执行速度，我创建了以下稀疏数组（2000 by 300）：

r = 2000; c = 300
x = scipy.sparse.lil_matrix( (r,c) )
for _ in range(r):
    x[np.random.randint(0,r-1), np.random.randint(0,c-1)] = np.random.randint(1,100)

我的代码运行比你的快 12 倍。

更快的解决方案（其他格式）

或者生成一个 2-D (Numpy) 数组可能会更好，有 2 行：

第一行 - 连续非零元素的行索引，
第二行 - 列索引。

要生成这样的结果，您可以使用以下代码：

def nonzero_indices_2d(input):
    cx = input.T.tocoo()
    return np.array([cx.row, cx.col])

运行比我的第一个解决方案快 4 倍。

当然，你的代码的其他部分应该重新编写，以消耗以另一种格式给出的索引。

稀疏数组也有其 own nonzero 方法：

arr.nonzero()

生成 2 行 Numpy 索引数组。这个函数运行s 还没有快了几个百分点。

所以，假设 2-D 结果格式是可以接受的（而不是列表的列表），也许你不需要任何自己的函数来获取这些指数。

另一个需要考虑的细节：是否（在所有版本中）应该有使用 t运行位置。您的选择，但没有 t运行sposition 每个版本代码会运行快一点。

Answer 2

当然可以。您非常接近理想的解决方案，但您正在分配一些不必要的数组。这是一个更快的方法：

from scipy import sparse
import numpy as np

def my_impl(csc):
    csr = csc.tocsr()
    return np.split(csr.indices, csr.indptr[1:-1])

def your_impl(input):
    return [
        np.nonzero(row)[1] 
        for row in sparse.csr_matrix(input)
    ]

## Results

# demo data
csc = sparse.random(15000, 5000, format="csc")

your_result = your_impl(csc)
my_result = my_impl(csc)

## Tests for correctness

# Same result
assert all(np.array_equal(x, y) for x, y in zip(your_result, my_result))
# Right number of rows
assert len(my_result) == csc.shape[0]

## Speed

%timeit my_impl(csc)
# 31 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit your_impl(csc)
# 1.49 s ± 19.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

附带问题，为什么要转置矩阵？那么你不会得到列的非零条目吗？如果那是你想要的，你甚至不需要转换为 csr，只需运行:

np.split(csc.indices, csc.indptr[1:-1])

快速查找 Python 中 CSC 矩阵中每一行的非零条目索引

Fast way to find indexes of nonzero entries for every row in a CSC matrix in Python

python

numpy

sparse-matrix

更快的解决方案（其他格式）