应用 scipy.sparse.linalg.svds 会引发内存错误?

Applying scipy.sparse.linalg.svds throws a Memory Error?

我尝试在具有 140GB RAM 的 64 位计算机上使用 scipy.sparse.linalg.svds 分解稀疏矩阵 (40,000×1,400,000)。如下:

k = 5000
tfidf_mtx = tfidf_m.tocsr()
u_45,s_45,vT_45 = scipy.sparse.linalg.svds(tfidf_mtx, k=k)

当K值在1000到4500之间时有效。但是 K 是 5000,它抛出一个 MemoryError.The 精确的错误如下:

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-6-31a69ce54e2c> in <module>()
      4 k = 4000
      5 tfidf_mtx = tfidf_m.tocsr()
----> 6 get_ipython().magic(u'time u_50,s_50,vT_50 =linalg.svds(tfidf_mtx, k=k))
      7 # print len(s),s
      8 

/usr/lib/python2.7/dist-packages/IPython/core/interactiveshell.pyc in magic(self, arg_s)
   2163         magic_name, _, magic_arg_s = arg_s.partition(' ')
   2164         magic_name = magic_name.lstrip(prefilter.ESC_MAGIC)
-> 2165         return self.run_line_magic(magic_name, magic_arg_s)
   2166 
   2167     #-------------------------------------------------------------------------

/usr/lib/python2.7/dist-packages/IPython/core/interactiveshell.pyc in run_line_magic(self, magic_name, line)
   2084                 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
   2085             with self.builtin_trap:
-> 2086                 result = fn(*args,**kwargs)
   2087             return result
   2088 

/usr/lib/python2.7/dist-packages/IPython/core/magics/execution.pyc in time(self, line, cell, local_ns)

/usr/lib/python2.7/dist-packages/IPython/core/magic.pyc in <lambda>(f, *a, **k)
    189     # but it's overkill for just that one bit of state.
    190     def magic_deco(arg):
--> 191         call = lambda f, *a, **k: f(*a, **k)
    192 
    193         if callable(arg):

/usr/lib/python2.7/dist-packages/IPython/core/magics/execution.pyc in time(self, line, cell, local_ns)
   1043         else:
   1044             st = clock2()
-> 1045             exec code in glob, local_ns
   1046             end = clock2()
   1047             out = None

<timed exec> in <module>()

/usr/local/lib/python2.7/dist-packages/scipy/sparse/linalg/eigen/arpack/arpack.pyc in svds(A, k, ncv, tol, which, v0, maxiter, return_singular_vectors)
   1751         else:
   1752             ularge = eigvec[:, above_cutoff]
-> 1753             vhlarge = _herm(X_matmat(ularge) / slarge)
   1754 
   1755         u = _augmented_orthonormal_cols(ularge, nsmall)

/usr/local/lib/python2.7/dist-packages/scipy/sparse/base.pyc in dot(self, other)
    244 
    245         """
--> 246         return self * other
    247 
    248     def __eq__(self, other):

/usr/local/lib/python2.7/dist-packages/scipy/sparse/base.pyc in __mul__(self, other)
    298                 return self._mul_vector(other.ravel()).reshape(M, 1)
    299             elif other.ndim == 2 and other.shape[0] == N:
--> 300                 return self._mul_multivector(other)
    301 
    302         if isscalarlike(other):

/usr/local/lib/python2.7/dist-packages/scipy/sparse/compressed.pyc in _mul_multivector(self, other)
    463 
    464         result = np.zeros((M,n_vecs), dtype=upcast_char(self.dtype.char,
--> 465                                                         other.dtype.char))
    466 
    467         # csr_matvecs or csc_matvecs

MemoryError: 

当k为3000和4500时,奇异值平方和与所有矩阵实体平方和的比值分别为0.7033和0.8230。我在网上找了很久。但是没有用。请帮助或尝试提供一些想法如何实现这一目标。

所以 return 是一个 (M,k) 数组。在普通的旧机器上:

In [368]: np.ones((40000,1000))
....
In [369]: np.ones((40000,4000))
...
In [370]: np.ones((40000,5000))
 ...
--> 190     a = empty(shape, dtype, order)
    191     multiarray.copyto(a, 1, casting='unsafe')
    192     return a
MemoryError: 

现在可能只是巧合,我在与您的代码大小相同的情况下遇到了内存错误。但是如果你把问题弄得足够大,你会在某个时候遇到内存错误。

您的堆栈跟踪显示在将稀疏矩阵与密集二维数组(其他)相乘时发生错误,并且结果也将是密集的。