在 scipy 稀疏矩阵上分组

Question

我有一个 scipy 具有 10e6 行和 10e3 列的稀疏矩阵，填充为 1%。我还有一个大小为 10e6 的数组，其中包含与我的稀疏矩阵的 10e6 行对应的键。我想按照这些键对我的稀疏矩阵进行分组，并使用求和函数进行聚合。

示例：

Keys:
['foo','bar','foo','baz','baz','bar']

Sparse matrix:
(0,1) 3              -> corresponds to the first 'foo' key
(0,10) 4             -> corresponds to the first 'bar' key
(2,1) 1              -> corresponds to the second 'foo' key
(1,3) 2              -> corresponds to the first 'baz' key
(2,3) 10             -> corresponds to the second 'baz' key
(2,4) 1              -> corresponds to the second 'bar' key

Expected result:
{
    'foo': {1: 4},               -> 4 = 3 + 1
    'bar': {4: 1, 10: 4},        
    'baz': {3: 12}               -> 12 = 2 + 10
}

更有效的方法是什么？

我已经尝试在我的稀疏矩阵上使用 pandas.SparseSeries.from_coo 以便能够使用 pandas 分组依据，但我遇到了这个已知错误：

site-packages/pandas/tools/merge.py in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy)
    863         for obj in objs:
    864             if not isinstance(obj, NDFrame):
--> 865                 raise TypeError("cannot concatenate a non-NDFrame object")
    866 
    867             # consolidate

TypeError: cannot concatenate a non-NDFrame object

Answer 1

我可以用基本的字典和列表操作生成你的目标：

keys = ['foo','bar','foo','baz','baz','bar']
rows = [0,0,2,1,2,2]; cols=[1,10,1,3,3,4]; data=[3,4,1,2,10,1]
dd = {}
for i,k in enumerate(keys):
    d1 = dd.get(k, {})
    v = d1.get(cols[i], 0)
    d1[cols[i]] = v + data[i]
    dd[k] = d1
print dd

生产

{'baz': {3: 12}, 'foo': {1: 4}, 'bar': {10: 4, 4: 1}}

我也可以从这些数据生成一个稀疏矩阵：

import numpy as np
from scipy import sparse
M = sparse.coo_matrix((data,(rows,cols)))
print M
print
Md = M.todok()
print Md

但注意术语的顺序是不固定的。在 coo 中，顺序与输入的一样，但更改格式和顺序。换句话说，keys 和稀疏矩阵的元素之间的匹配是未指定的。

  (0, 1)    3
  (0, 10)   4
  (2, 1)    1
  (1, 3)    2
  (2, 3)    10
  (2, 4)    1

  (0, 1)    3
  (1, 3)    2
  (2, 1)    1
  (2, 3)    10
  (0, 10)   4
  (2, 4)    1

在您清除此映射之前，最好使用初始字典方法。

在 scipy 稀疏矩阵上分组

group by on scipy sparse matrix

python

group-by

scipy

sparse-matrix

pandas