如何 select 来自稀疏矩阵的一些行然后使用它们形成一个新的稀疏矩阵

How to select some rows from sparse matrix then use them form a new sparse matrix

我有一个非常大的稀疏矩阵(100000 列和 100000 行)。我想 select 这个稀疏矩阵的一些行,然后用它们组成一个新的稀疏矩阵。我试图通过首先将它们转换为密集矩阵然后再次将它们转换为稀疏矩阵来做到这一点。但是当我这样做时 python 提出 'Memory error'。然后我尝试了另一种方法,这是我 select 稀疏矩阵的行,然后将它们放入一个数组,但是当我尝试将这个数组转换为稀疏矩阵时,它说: 'ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().' 那么如何将这个列表稀疏矩阵转换为一个大的稀疏矩阵呢?

# X_train is a sparse matrix of size 100000x100000, it is in sparse form
# y_train is a 1 denmentional array with length 100000
# I try to get a new sparse matrix by using some rows of X_train, the 
#selection criteria is sum of the sparse row = 0

#y_train_new = []
#X_train_new = []
for i in range(len(y_train)):
    if np.sum(X_train[i].toarray()[0]) == 0:
        X_train_new.append(X_train[i])
        y_train_new.append(y_train[i])

当我这样做时:

X_train_new = scipy.sparse.csr_matrix(X_train_new)

我收到错误消息:

'ValueError: The truth value of an array with more than one element is 
ambiguous. Use a.any() or a.all().'

我添加了一些标签,可以帮助我更快地看到您的问题。

询问错误时,提供部分或全部回溯是个好主意,这样我们就可以看到错误发生的位置。有关问题函数调用输入的信息也有帮助。

幸运的是,我可以很容易地重现问题 - 而且是在一个合理大小的示例中。不用做100000x10000的矩阵没人看!

制作适度大小的稀疏矩阵:

In [126]: M = sparse.random(10,10,.1,'csr')                                                              
In [127]: M                                                                                              
Out[127]: 
<10x10 sparse matrix of type '<class 'numpy.float64'>'
    with 10 stored elements in Compressed Sparse Row format>

我可以对整个矩阵行求和,就像处理密集数组一样。稀疏代码实际上使用 matrix-vector 乘法来执行此操作,从而生成密集矩阵。

In [128]: M.sum(axis=1)                                                                                  
Out[128]: 
matrix([[0.59659958],
        [0.80390719],
        [0.37251645],
        [0.        ],
        [0.85766909],
        [0.42267366],
        [0.76794737],
        [0.        ],
        [0.83131054],
        [0.46254634]])

它足够稀疏,以至于有些行没有零。对于浮点数,尤其是在 0-1 范围内,我不会得到非零值抵消的行。

或者使用你的逐行计算:

In [133]: alist = [np.sum(row.toarray()[0]) for row in M]                                                
In [134]: alist                                                                                          
Out[134]: 
[0.5965995802776853,
 0.8039071870427961,
 0.37251644566924424,
 0.0,
 0.8576690924353791,
 0.42267365715276595,
 0.7679473651419432,
 0.0,
 0.8313105376003095,
 0.4625463360625408]

并且 select 求和为零的行(在本例中为空行):

In [135]: alist = [row for row in M if np.sum(row.toarray()[0])==0]                                      
In [136]: alist                                                                                          
Out[136]: 
[<1x10 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Compressed Sparse Row format>,
 <1x10 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Compressed Sparse Row format>]

请注意,这是一个稀疏矩阵列表。这也是你得到的,对吧?

现在,如果我尝试从中创建矩阵,我会得到你的错误:

In [137]: sparse.csr_matrix(alist)                                                                       
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-137-5e20e6fc2524> in <module>
----> 1 sparse.csr_matrix(alist)

/usr/local/lib/python3.6/dist-packages/scipy/sparse/compressed.py in __init__(self, arg1, shape, dtype, copy)
     86                                  "".format(self.format))
     87             from .coo import coo_matrix
---> 88             self._set_self(self.__class__(coo_matrix(arg1, dtype=dtype)))
     89 
     90         # Read matrix dimensions given, if any

/usr/local/lib/python3.6/dist-packages/scipy/sparse/coo.py in __init__(self, arg1, shape, dtype, copy)
    189                                          (shape, self._shape))
    190 
--> 191                 self.row, self.col = M.nonzero()
    192                 self.data = M[self.row, self.col]
    193                 self.has_canonical_format = True

/usr/local/lib/python3.6/dist-packages/scipy/sparse/base.py in __bool__(self)
    285             return self.nnz != 0
    286         else:
--> 287             raise ValueError("The truth value of an array with more than one "
    288                              "element is ambiguous. Use a.any() or a.all().")
    289     __nonzero__ = __bool__

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

好吧,这个错误并没有告诉我很多(至少没有更多阅读代码),但它显然是输入列表有问题。但是再次阅读 csr_matrix 文档!它说我们可以给它一个稀疏矩阵列表吗?

但是有一个 sparse.vstack 函数可以处理矩阵列表(以 np.vstack 为模型):

In [140]: sparse.vstack(alist)                                                                           
Out[140]: 
<2x10 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Compressed Sparse Row format>

如果我们 select 总和不为零的行,我们会得到更有趣的结果:

In [141]: alist = [row for row in M if np.sum(row.toarray()[0])!=0]                                      
In [142]: M1=sparse.vstack(alist)                                                                        
In [143]: M1                                                                                             
Out[143]: 
<8x10 sparse matrix of type '<class 'numpy.float64'>'
    with 10 stored elements in Compressed Sparse Row format>

但我之前展示了我们可以在不迭代的情况下获得行总和。将 where 应用于 Out[128],我得到行索引(非零行的):

In [151]: idx=np.where(M.sum(axis=1))                                                                    
In [152]: idx                                                                                            
Out[152]: (array([0, 1, 2, 4, 5, 6, 8, 9]), array([0, 0, 0, 0, 0, 0, 0, 0]))
In [153]: M2=M[idx[0],:]                                                                                 
In [154]: M2                                                                                             
Out[154]: 
<8x10 sparse matrix of type '<class 'numpy.float64'>'
    with 10 stored elements in Compressed Sparse Row format>
In [155]: np.allclose(M1.A, M2.A)                                                                        
Out[155]: True

====

我怀疑 In[137] 是在试图找到输入的 nonzero (np.where) 元素时产生的,或者将输入转换为 numpy 数组:

In [159]: alist = [row for row in M if np.sum(row.toarray()[0])==0]                                      
In [160]: np.array(alist)                                                                                
Out[160]: 
array([<1x10 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Compressed Sparse Row format>,
       <1x10 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Compressed Sparse Row format>], dtype=object)
In [161]: np.array(alist).nonzero()                                                                      
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-161-832a25987c15> in <module>
----> 1 np.array(alist).nonzero()

/usr/local/lib/python3.6/dist-packages/scipy/sparse/base.py in __bool__(self)
    285             return self.nnz != 0
    286         else:
--> 287             raise ValueError("The truth value of an array with more than one "
    288                              "element is ambiguous. Use a.any() or a.all().")
    289     __nonzero__ = __bool__

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

np.array 在稀疏矩阵列表上生成这些矩阵的对象 dtype 数组。