如何将 .text 的稀疏表示转换为 scipy 中的密集矩阵？

Question

我有一个来自 epinion 数据集的 .txt 文件，它是一个稀疏表示（即 23 387 5 代表事实 "user 23 has rated item 387 as 5") 。从这种稀疏格式我想把它转移到它的密集表示 scipy 这样我就可以对它进行矩阵分解。

我已经用 loadtxt() 从 numpy 加载了文件，它是一个 [664824, 3] 数组。使用 scipy.sparse.csr_matrix 我将它转移到 numpy 数组并使用 scipy 中的 todense() 我希望实现密集格式但我总是得到相同的矩阵：[664824, 3]。我怎样才能把它变成原来的 [40163,139738] 密集表示？

import numpy as np
from io import StringIO

d = np.loadtxt("MFCode/Epinions_dataset.txt") 
S = csr_matrix(d)
D = R.todense()

我希望得到一个形状为 [40163,139738]

的密集矩阵

Answer 1

像文本这样的小示例 csv：

In [218]: np.lib.format.open_memmap?                                            
In [219]: txt = """0 1 3 
     ...: 1 0 4 
     ...: 2 2 5 
     ...: 0 3 6""".splitlines()                                                 
In [220]: data = np.loadtxt(txt)                                                
In [221]: data                                                                  
Out[221]: 
array([[0., 1., 3.],
       [1., 0., 4.],
       [2., 2., 5.],
       [0., 3., 6.]])

使用sparse，使用（数据，（行，列））输入样式：

In [222]: from scipy import sparse                                              
In [223]: M = sparse.coo_matrix((data[:,2], (data[:,0], data[:,1])), shape=(5,4))                                                                     
In [224]: M                                                                     
Out[224]: 
<5x4 sparse matrix of type '<class 'numpy.float64'>'
    with 4 stored elements in COOrdinate format>
In [225]: M.A                                                                   
Out[225]: 
array([[0., 3., 0., 6.],
       [4., 0., 0., 0.],
       [0., 0., 5., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

或者直接填一个zeros数组：

In [226]: arr = np.zeros((5,4))                                                 
In [227]: arr[data[:,0].astype(int), data[:,1].astype(int)]=data[:,2]           
In [228]: arr                                                                   
Out[228]: 
array([[0., 3., 0., 6.],
       [4., 0., 0., 0.],
       [0., 0., 5., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

但请注意 np.zeros([40163,139738]) 可能会引发内存错误。 M.A (M.toarray())` 也可以做到这一点。

如何将 .text 的稀疏表示转换为 scipy 中的密集矩阵？

How can I transfer an sparse representaion of .txt to a dense matrix in scipy?

numpy

scipy

sparse-matrix