如何将 data.frame 强制转换为 R 中的稀疏矩阵
how to coerce a data.frame into a sparse matrix in R
我正在尝试按照此处的示例进行操作:cui2vecWorkflow by creating a matrix similar to the one here term_cooccurrence_matrix.rda 具有以下属性:
> cooc<-get(load('~/development/cui2vec/vignettes/term_cooccurrence_matrix.rda'))
> str(cooc)
Formal class 'dsCMatrix' [package "Matrix"] with 7 slots
..@ i : int [1:2366] 0 1 2 0 1 2 3 4 3 5 ...
..@ p : int [1:101] 0 1 2 3 7 8 10 17 19 27 ...
..@ Dim : int [1:2] 100 100
..@ Dimnames:List of 2
.. ..$ : chr [1:100] "C0016875" "C0162770" "C0024730" "C0038689" ...
.. ..$ : chr [1:100] "C0016875" "C0162770" "C0024730" "C0038689" ...
..@ x : num [1:2366] 412 6286 8280 118 110 ...
..@ uplo : chr "U"
..@ factors : list()
我的数据框看起来像:
> test
CUI1 CUI2 Count
1 C0000699 C3894683 2
2 C0000699 C0101725 1
3 C0000699 C1882413 3
4 C0000699 C0245531 3
5 C0000699 C0068475 2
6 C0000699 C0538927 3
7 C0000699 C0724693 1
8 C0000699 C0216784 2
9 C0000699 C2248020 1
10 C0000699 C0069449 3
...
但是当我读入它并转换为矩阵时,它显然不会是相同的结构,按照
> mat <- as.matrix(test)
> str(mat)
chr [1:1000000, 1:3] "C0000699" "C0000699" "C0000699" "C0000699" "C0000699" "C0000699" "C0000699" "C0000699" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:3] "CUI1" "CUI2" "Count"
然后我采取下一步并将矩阵 mat
强制为稀疏矩阵:
> mat <- as(mat, "sparseMatrix")
> str(mat)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
..@ i : int [1:3000000] 0 1 2 3 4 5 6 7 8 9 ...
..@ p : int [1:4] 0 1000000 2000000 3000000
..@ Dim : int [1:2] 1000000 3
..@ Dimnames:List of 2
.. ..$ : NULL
.. ..$ : chr [1:3] "CUI1" "CUI2" "Count"
..@ x : num [1:3000000] NA NA NA NA NA NA NA NA NA NA ...
..@ factors : list()
但是结构看起来不对。
尝试这个,我得到一个错误:
> mat <- as(mat, "dsCMatrix")
Error in asMethod(object) :
not a symmetric matrix; consider forceSymmetric() or symmpart()
In addition: Warning message:
In storage.mode(from) <- "double" : NAs introduced by coercion
所以我试试这个:
> mat <- as(forceSymmetric(mat), "dsCMatrix")
Error in forceSymmetric(mat) :
invalid class 'NA' to dup_mMatrix_as_geMatrix
(我还没有找到任何关于如何从 data.frame 构造 class structure("dsCMatrix", package = "Matrix")
矩阵的示例,所以我只是即兴发挥)。
Dim
和 Dimnames
以及 x
的值似乎未正确定义。
在之后,首先将CUI*
列强制分解为具有相同水平的因子,然后使用xtabs
创建一个稀疏矩阵,然后添加其转置。
txt <- '
CUI1 CUI2 Count
1 C0000699 C3894683 2
2 C0000699 C0101725 1
3 C0000699 C1882413 3
4 C0000699 C0245531 3
5 C0000699 C0068475 2
6 C0000699 C0538927 3
7 C0000699 C0724693 1
8 C0000699 C0216784 2
9 C0000699 C2248020 1
10 C0000699 C0069449 3
'
test <- read.table(textConnection(txt), header = TRUE)
library(Matrix)
levls <- Reduce(union, test[1:2])
test[1:2] <- lapply(test[1:2], factor, levels = levls)
res <- xtabs(Count ~ CUI1 + CUI2, data = test, sparse = TRUE)
res <- forceSymmetric(res)
class(res)
#> [1] "dsCMatrix"
#> attr(,"package")
#> [1] "Matrix"
由 reprex package (v2.0.1)
于 2022-02-13 创建
我正在尝试按照此处的示例进行操作:cui2vecWorkflow by creating a matrix similar to the one here term_cooccurrence_matrix.rda 具有以下属性:
> cooc<-get(load('~/development/cui2vec/vignettes/term_cooccurrence_matrix.rda'))
> str(cooc)
Formal class 'dsCMatrix' [package "Matrix"] with 7 slots
..@ i : int [1:2366] 0 1 2 0 1 2 3 4 3 5 ...
..@ p : int [1:101] 0 1 2 3 7 8 10 17 19 27 ...
..@ Dim : int [1:2] 100 100
..@ Dimnames:List of 2
.. ..$ : chr [1:100] "C0016875" "C0162770" "C0024730" "C0038689" ...
.. ..$ : chr [1:100] "C0016875" "C0162770" "C0024730" "C0038689" ...
..@ x : num [1:2366] 412 6286 8280 118 110 ...
..@ uplo : chr "U"
..@ factors : list()
我的数据框看起来像:
> test
CUI1 CUI2 Count
1 C0000699 C3894683 2
2 C0000699 C0101725 1
3 C0000699 C1882413 3
4 C0000699 C0245531 3
5 C0000699 C0068475 2
6 C0000699 C0538927 3
7 C0000699 C0724693 1
8 C0000699 C0216784 2
9 C0000699 C2248020 1
10 C0000699 C0069449 3
...
但是当我读入它并转换为矩阵时,它显然不会是相同的结构,按照
> mat <- as.matrix(test)
> str(mat)
chr [1:1000000, 1:3] "C0000699" "C0000699" "C0000699" "C0000699" "C0000699" "C0000699" "C0000699" "C0000699" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:3] "CUI1" "CUI2" "Count"
然后我采取下一步并将矩阵 mat
强制为稀疏矩阵:
> mat <- as(mat, "sparseMatrix")
> str(mat)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
..@ i : int [1:3000000] 0 1 2 3 4 5 6 7 8 9 ...
..@ p : int [1:4] 0 1000000 2000000 3000000
..@ Dim : int [1:2] 1000000 3
..@ Dimnames:List of 2
.. ..$ : NULL
.. ..$ : chr [1:3] "CUI1" "CUI2" "Count"
..@ x : num [1:3000000] NA NA NA NA NA NA NA NA NA NA ...
..@ factors : list()
但是结构看起来不对。
尝试这个,我得到一个错误:
> mat <- as(mat, "dsCMatrix")
Error in asMethod(object) :
not a symmetric matrix; consider forceSymmetric() or symmpart()
In addition: Warning message:
In storage.mode(from) <- "double" : NAs introduced by coercion
所以我试试这个:
> mat <- as(forceSymmetric(mat), "dsCMatrix")
Error in forceSymmetric(mat) :
invalid class 'NA' to dup_mMatrix_as_geMatrix
(我还没有找到任何关于如何从 data.frame 构造 class structure("dsCMatrix", package = "Matrix")
矩阵的示例,所以我只是即兴发挥)。
Dim
和 Dimnames
以及 x
的值似乎未正确定义。
在CUI*
列强制分解为具有相同水平的因子,然后使用xtabs
创建一个稀疏矩阵,然后添加其转置。
txt <- '
CUI1 CUI2 Count
1 C0000699 C3894683 2
2 C0000699 C0101725 1
3 C0000699 C1882413 3
4 C0000699 C0245531 3
5 C0000699 C0068475 2
6 C0000699 C0538927 3
7 C0000699 C0724693 1
8 C0000699 C0216784 2
9 C0000699 C2248020 1
10 C0000699 C0069449 3
'
test <- read.table(textConnection(txt), header = TRUE)
library(Matrix)
levls <- Reduce(union, test[1:2])
test[1:2] <- lapply(test[1:2], factor, levels = levls)
res <- xtabs(Count ~ CUI1 + CUI2, data = test, sparse = TRUE)
res <- forceSymmetric(res)
class(res)
#> [1] "dsCMatrix"
#> attr(,"package")
#> [1] "Matrix"
由 reprex package (v2.0.1)
于 2022-02-13 创建