如何将 data.frame 强制转换为 R 中的稀疏矩阵

how to coerce a data.frame into a sparse matrix in R

我正在尝试按照此处的示例进行操作:cui2vecWorkflow by creating a matrix similar to the one here term_cooccurrence_matrix.rda 具有以下属性:

> cooc<-get(load('~/development/cui2vec/vignettes/term_cooccurrence_matrix.rda'))
> str(cooc)
Formal class 'dsCMatrix' [package "Matrix"] with 7 slots
  ..@ i       : int [1:2366] 0 1 2 0 1 2 3 4 3 5 ...
  ..@ p       : int [1:101] 0 1 2 3 7 8 10 17 19 27 ...
  ..@ Dim     : int [1:2] 100 100
  ..@ Dimnames:List of 2
  .. ..$ : chr [1:100] "C0016875" "C0162770" "C0024730" "C0038689" ...
  .. ..$ : chr [1:100] "C0016875" "C0162770" "C0024730" "C0038689" ...
  ..@ x       : num [1:2366] 412 6286 8280 118 110 ...
  ..@ uplo    : chr "U"
  ..@ factors : list()

我的数据框看起来像:

> test
        CUI1     CUI2 Count
1   C0000699 C3894683     2
2   C0000699 C0101725     1
3   C0000699 C1882413     3
4   C0000699 C0245531     3
5   C0000699 C0068475     2
6   C0000699 C0538927     3
7   C0000699 C0724693     1
8   C0000699 C0216784     2
9   C0000699 C2248020     1
10  C0000699 C0069449     3
...

但是当我读入它并转换为矩阵时,它显然不会是相同的结构,按照

> mat <- as.matrix(test)
> str(mat)
 chr [1:1000000, 1:3] "C0000699" "C0000699" "C0000699" "C0000699" "C0000699" "C0000699" "C0000699" "C0000699" ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:3] "CUI1" "CUI2" "Count" 

然后我采取下一步并将矩阵 mat 强制为稀疏矩阵:

> mat <- as(mat,  "sparseMatrix")
> str(mat)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:3000000] 0 1 2 3 4 5 6 7 8 9 ...
  ..@ p       : int [1:4] 0 1000000 2000000 3000000
  ..@ Dim     : int [1:2] 1000000 3
  ..@ Dimnames:List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:3] "CUI1" "CUI2" "Count"
  ..@ x       : num [1:3000000] NA NA NA NA NA NA NA NA NA NA ...
  ..@ factors : list()

但是结构看起来不对。

尝试这个,我得到一个错误:

> mat <- as(mat,  "dsCMatrix")
Error in asMethod(object) : 
  not a symmetric matrix; consider forceSymmetric() or symmpart()
In addition: Warning message:
In storage.mode(from) <- "double" : NAs introduced by coercion

所以我试试这个:

> mat <- as(forceSymmetric(mat),  "dsCMatrix")
Error in forceSymmetric(mat) : 
  invalid class 'NA' to dup_mMatrix_as_geMatrix

(我还没有找到任何关于如何从 data.frame 构造 class structure("dsCMatrix", package = "Matrix") 矩阵的示例,所以我只是即兴发挥)。

DimDimnames 以及 x 的值似乎未正确定义。

之后,首先将CUI*列强制分解为具有相同水平的因子,然后使用xtabs创建一个稀疏矩阵,然后添加其转置。

txt <- '
        CUI1     CUI2 Count
1   C0000699 C3894683     2
2   C0000699 C0101725     1
3   C0000699 C1882413     3
4   C0000699 C0245531     3
5   C0000699 C0068475     2
6   C0000699 C0538927     3
7   C0000699 C0724693     1
8   C0000699 C0216784     2
9   C0000699 C2248020     1
10  C0000699 C0069449     3
'
test <- read.table(textConnection(txt), header = TRUE)

library(Matrix)

levls <- Reduce(union, test[1:2])
test[1:2] <- lapply(test[1:2], factor, levels = levls)
res <- xtabs(Count ~ CUI1 + CUI2, data = test, sparse = TRUE)
res <- forceSymmetric(res)
class(res)
#> [1] "dsCMatrix"
#> attr(,"package")
#> [1] "Matrix"

reprex package (v2.0.1)

于 2022-02-13 创建