python 中稀疏矩阵的高效分组

Question

我有一个非常稀疏的矩阵，它有大约 10M 行 (N) 和大约 1K 列 (P)。每行只有一个非空值，所以有 N 个条目。我还有一个 10M 非唯一值的数组，它们是 "keys" 我想按它们分组我的稀疏矩阵。让我们用 K 表示唯一键的数量 (~200K)。

结果应该是一个稀疏矩阵，有 K 行和 P 列，有 N 个条目。

应用 pandas groupby 函数对我来说性能很差，输出也不稀疏。我尝试应用 itertools.groupby 但循环花费了太多时间。我想我做错了什么。

我附上一个代码来复制我的问题：

import numpy as np
from scipy.sparse import coo_matrix
N = 10000000
P = 1000
K = 200000
keys = np.random.choice(K, N)
row = [i for i in range(N)]
col = np.random.choice(P, N)
data = np.random.normal(loc = 100, scale = 2, size = N)
mat = coo_matrix((data, (row, col)), shape=(N,P))

我将非常感谢一个代码，它将返回一个 matrix/dataframe 维度为 K X P 的 N 个条目，键作为索引。

非常感谢！艾伦

Answer 1

对我来说这看起来像是普通的矩阵乘法。请注意，由于键冲突，生成的矩阵将具有少于 1000 万个非零值。代码的最后一位标识了这些冲突的键、列和多重性：

import numpy as np
from scipy.sparse import coo_matrix
N = 10_000_000
P = 1000
K = 200_000
keys = np.random.choice(K, N)
row = [i for i in range(N)]
col = np.random.choice(P, N)
data = np.random.normal(loc = 100, scale = 2, size = N)
mat = coo_matrix((data, (row, col)), shape=(N,P))


import time

t0 = time.perf_counter()
from scipy import sparse

km = sparse.csc_matrix((np.ones(N,), keys, np.arange(N+1)), (K, N))
result = km @ mat
t1 = time.perf_counter()
print(f'{t1-t0:.3f} seconds')

print(f'{result!r}')
print()

# identify collisions
t0 = time.perf_counter()
idctr = sparse.csr_matrix((np.ones((N,), int), col, np.arange(N+1)), (N, P))
coll = km @ idctr
cidx = *map(np.flatnonzero, map(coll.data.__eq__, range(2, 6))),
c2, c3, c4, c5 = map(np.searchsorted, 4 * (coll.indptr,), cidx, 4 * ('left',))
k2, k3, k4, k5 = map(coll.indices.__getitem__, cidx)
t1 = time.perf_counter()
print(f'{t1-t0:.3f} seconds')
print(', '.join(map(str, map(len, (k2, k3, k4, k5)))), 'collisions of 2, 3, 4, 5 keys')

示例输出：

3.998 seconds
<200000x1000 sparse matrix of type '<class 'numpy.float64'>'
        with 9754546 stored elements in Compressed Sparse Column format>

4.110 seconds
237476, 3908, 54, 0 collisions of 2, 3, 4, 5 keys

python 中稀疏矩阵的高效分组

Efficiently groupby of a sparse matrix in python

scipy

sparse-matrix

itertools

pandas-groupby