写入由矩阵制成的列表列表的有效方法

Question

我正在尝试使用以下代码填充另一个稀疏 lil 矩阵的部分内容：

adj_mat = sp.dok_matrix((self.n_users + self.m_items, self.n_users + self.m_items), dtype=np.float32)
adj_mat = adj_mat.tolil()
R = self.UserItemNet.tolil()

当我尝试用此代码填充时：

adj_mat[:self.n_users, self.n_users:] = R
adj_mat[self.n_users:, :self.n_users] = R.T

由于超出 RAM 内存 (240Gi)，我的进程被终止。我的数据集很大：

adj_mat:

<1374194x1374194 sparse matrix of type '<class 'numpy.float32'>'
with 0 stored elements in List of Lists format>

R:

<940696x433498 sparse matrix of type '<class 'numpy.float64'>'
with 24053124 stored elements in List of Lists format>

self.n_users = 940696

有没有更有效的方法来填充这样的列表列表？

此致

Answer 1

这是构建复合矩阵的bmat方法（假设我已经推断出正确的布局）：

做一个矩阵。 bmat 将组合 coo 个属性，所以让我们从那个开始：

In [389]: R = sparse.coo_matrix([[0,1],[2,0],[0,0],[3,4]])
In [390]: R
Out[390]: 
<4x2 sparse matrix of type '<class 'numpy.int64'>'
    with 4 stored elements in COOrdinate format>
In [391]: R.A
Out[391]: 
array([[0, 1],
       [2, 0],
       [0, 0],
       [3, 4]])

并定义 'blank' 个填充矩阵：

In [392]: Z1 = sparse.coo_matrix((4,4),dtype=int)
In [393]: Z2 = sparse.coo_matrix((2,2),dtype=int)

现在加入他们：

In [394]: M = sparse.bmat([[Z1,R],[R.T,Z2]])
In [395]: M
Out[395]: 
<6x6 sparse matrix of type '<class 'numpy.int64'>'
    with 8 stored elements in COOrdinate format>
In [396]: M.A
Out[396]: 
array([[0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 2, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 3, 4],
       [0, 2, 0, 3, 0, 0],
       [1, 0, 0, 4, 0, 0]])

这将避免默认分配明显造成的致密化。

block_diag 使用另一条对角线：

In [398]: sparse.block_diag([R,R.T])
Out[398]: 
<6x6 sparse matrix of type '<class 'numpy.int64'>'
    with 8 stored elements in COOrdinate format>
In [399]: _.A
Out[399]: 
array([[0, 1, 0, 0, 0, 0],
       [2, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [3, 4, 0, 0, 0, 0],
       [0, 0, 0, 2, 0, 3],
       [0, 0, 1, 0, 0, 4]])

如果您想编写自己的版本，此 block_diag 代码将是一个很好的模型。 v1.6 发行说明声称它比以前的版本（我相信通过 bmat 有效）更有效。

分配效率

为了回应@CJR 关于 lil 内存效率低下的评论，我查看了一些替代方案。

制作一个大的coo矩阵：

In [10]: M=sparse.random(10000,10000, .2, 'coo')

转换为 lil 比转换为 csr 慢：

In [11]: timeit M.tocsr()
1.43 s ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [12]: timeit M.tolil()
3.69 s ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

那么块赋值比较如何呢？（使用比 OP 小得多的块）：

In [13]: Ml=M.tolil(); Mr=M.tocsr()

In [14]: timeit Ml[:100,:100]=np.eye(100)
1.07 ms ± 341 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [15]: timeit Mr[:100,:100]=np.eye(100)
/usr/local/lib/python3.8/dist-packages/scipy/sparse/_index.py:125: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  self._set_arrayXarray(i, j, x)
14.1 ms ± 144 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

csr 赋值有点慢，而 coo 赋值甚至不起作用。

In [16]: timeit M[:100,:100]=np.eye(100)
Traceback (most recent call last):
  ....
TypeError: 'coo_matrix' object does not support item assignment

因此，如果您必须分块赋值，lil 是个不错的选择，前提是分块不是太大。但是通过 bmat 直接从块构建矩阵更好。正如 lil 文档所说，如果要构建大型矩阵，请使用 coo。

写入由矩阵制成的列表列表的有效方法

efficient way for writing to list of lists made from matrix

python

matrix

memory

scipy

sparse-matrix

分配效率