scipy.sparse 的哪种格式最适合这种类型的矩阵生成和使用？

Question

我有一个数据文件，它对有关大型稀疏布尔矩阵的非零元素的信息进行编码。该矩阵没有任何特定结构，即它不是对角线或块等。文件的每一行都确定一个元素。现在我使用以下循环来填充矩阵：

from scipy.sparse import dok_matrix

nRows = 30000
nCols = 600000

data = dok_matrix((nRows,nCols), dtype=np.int8)

with open('input.txt','r') as fraw:
    for line in fraw:
        ## Figure out iRow and iCol to set to 1 from line
        data[iRow,iCol] = 1

这是可行的，但速度很慢。是否有更优化的不同类型的 scipy.sparse 矩阵？

'Optimal'表示矩阵生成和访问矩阵行和列块的速度，例如像

这样的向量运算

someRows = data[rowIndex1:rowIndex2,]
someColumns = data[,colIndex1:colIndex2]

如果内存比速度更重要，答案会改变吗？

感谢

Answer 1

对于像这样的增量添加，dok 已经很好了。它实际上是一个字典，将值存储在一个元组中：(iRow,iCol)。所以存储和获取取决于基本的 Python 字典效率。

唯一适合增量添加的是 lil，它将数据存储为 2 个列表列表。

另一种方法是将您的数据收集在 3 个列表中，并在最后构建矩阵。首先是 coo 及其 (data,(i,j)) 输入法。

密集 numpy 数组是从具有 genfromtxt 或 loadtxt 的文件中加载的。两者都逐行读取文件，在列表列表中收集值，最后创建数组。

如果您只是读取文件并解析值 - 没有将任何内容保存到 dok，速度会怎样？这会让您了解将数据添加到矩阵实际花费了多少时间。

另一种可能性是将值直接存储到通用字典中，并使用它来创建 dok。

In [60]: adict=dict()

In [61]: for i in np.random.randint(1000,size=(2000,)):
    adict[(i,i)]=1
   ....:     

In [62]: dd=sparse.dok_matrix((1000,1000),dtype=np.int8)

In [63]: dd.update(adict)

In [64]: dd.A
Out[64]: 
array([[1, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 0, 1]], dtype=int8)

这比直接更新dok要快很多。

In [66]: %%timeit 
for i in np.random.randint(1000,size=(2000,)):
    adict[(i,i)]=1
dd.update(adict)
   ....: 
1000 loops, best of 3: 1.32 ms per loop

In [67]: %%timeit 
for i in np.random.randint(1000,size=(2000,)):
    dd[i,i]=1
   ....: 
10 loops, best of 3: 35.6 ms per loop

更新 dok 一定有一些我没有考虑到的开销。

我刚刚意识到我曾经建议过这种 update 方法：

Why are lil_matrix and dok_matrix so slow compared to common dict of dicts?

scipy.sparse 的哪种格式最适合这种类型的矩阵生成和使用？

Which format of scipy.sparse is best for this type of matrix generation and use?

python

vectorization

scipy

sparse-matrix