将 DocumentTermMatrix 转换为 dgTMatrix
Convert DocumentTermMatrix to dgTMatrix
我正在尝试 运行 来自 tm
包的 AssociatedPress 数据集通过 text2vec
的 LDA 实现。
我面临的问题是数据类型不兼容:AssociatedPress
是一个 tm::DocumentTermMatrix
,它又是 slam::simple_triplet_matrix
的一个子类。 text2vec
然而期望 x
到 text2vec::lda$fit_transform(x = ...)
的输入是 Matrix::dgTMatrix
.
因此我的问题是:有没有办法将 DocumentTermMatrix
强制为 text2vec
接受的内容?
最小(失败)示例:
library('tm')
library('text2vec')
data("AssociatedPress", package="topicmodels")
dtm <- AssociatedPress[1:10, ]
lda_model = LDA$new(
n_topics = 10,
doc_topic_prior = 0.1,
topic_word_prior = 0.01
)
doc_topic_distr =
lda_model$fit_transform(
x = dtm,
n_iter = 1000,
convergence_tol = 0.001,
n_check_convergence = 25,
progressbar = FALSE
)
...给出:
base::rowSums(x, na.rm = na.rm, dims = dims, ...) : 'x' must be an
array of at least two dimensions
答案在@Dmitriy Selivanov 提供的 duplicate 中。但它没有提到它来自基础包 Matrix
.
由于我没有安装 topicmodels
,我将使用 tm
包中包含的 crude
数据集。原理是一样的。
library(tm)
data("crude")
dtm <- DocumentTermMatrix(crude,
control = list(weighting =
function(x)
weightTfIdf(x, normalize =
FALSE),
stopwords = TRUE))
# transform into a sparseMatrix dgcMatrix
m <- Matrix::sparseMatrix(i=dtm$i,
j=dtm$j,
x=dtm$v,
dims=c(dtm$nrow, dtm$ncol),
dimnames = dtm$dimnames)
str(m)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
..@ i : int [1:1890] 6 1 18 6 6 5 9 12 9 5 ...
..@ p : int [1:1201] 0 1 2 3 4 5 6 8 9 11 ...
..@ Dim : int [1:2] 20 1200
..@ Dimnames:List of 2
.. ..$ Docs : chr [1:20] "127" "144" "191" "194" ...
.. ..$ Terms: chr [1:1200] "\"(it)" "\"demand" "\"expansion" "\"for" ...
..@ x : num [1:1890] 4.32 4.32 4.32 4.32 4.32 ...
..@ factors : list()
您的其余代码:
library(text2vec)
lda_model <- LDA$new(
n_topics = 10,
doc_topic_prior = 0.1,
topic_word_prior = 0.01
)
doc_topic_distr <-
lda_model$fit_transform(
x = m,
n_iter = 1000,
convergence_tol = 0.001,
n_check_convergence = 25,
progressbar = FALSE
)
INFO [2018-04-15 10:40:00] iter 25 loglikelihood = -32949.882
INFO [2018-04-15 10:40:00] iter 50 loglikelihood = -32901.801
INFO [2018-04-15 10:40:00] iter 75 loglikelihood = -32922.208
INFO [2018-04-15 10:40:00] early stopping at 75 iteration
我正在尝试 运行 来自 tm
包的 AssociatedPress 数据集通过 text2vec
的 LDA 实现。
我面临的问题是数据类型不兼容:AssociatedPress
是一个 tm::DocumentTermMatrix
,它又是 slam::simple_triplet_matrix
的一个子类。 text2vec
然而期望 x
到 text2vec::lda$fit_transform(x = ...)
的输入是 Matrix::dgTMatrix
.
因此我的问题是:有没有办法将 DocumentTermMatrix
强制为 text2vec
接受的内容?
最小(失败)示例:
library('tm')
library('text2vec')
data("AssociatedPress", package="topicmodels")
dtm <- AssociatedPress[1:10, ]
lda_model = LDA$new(
n_topics = 10,
doc_topic_prior = 0.1,
topic_word_prior = 0.01
)
doc_topic_distr =
lda_model$fit_transform(
x = dtm,
n_iter = 1000,
convergence_tol = 0.001,
n_check_convergence = 25,
progressbar = FALSE
)
...给出:
base::rowSums(x, na.rm = na.rm, dims = dims, ...) : 'x' must be an array of at least two dimensions
答案在@Dmitriy Selivanov 提供的 duplicate 中。但它没有提到它来自基础包 Matrix
.
由于我没有安装 topicmodels
,我将使用 tm
包中包含的 crude
数据集。原理是一样的。
library(tm)
data("crude")
dtm <- DocumentTermMatrix(crude,
control = list(weighting =
function(x)
weightTfIdf(x, normalize =
FALSE),
stopwords = TRUE))
# transform into a sparseMatrix dgcMatrix
m <- Matrix::sparseMatrix(i=dtm$i,
j=dtm$j,
x=dtm$v,
dims=c(dtm$nrow, dtm$ncol),
dimnames = dtm$dimnames)
str(m)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
..@ i : int [1:1890] 6 1 18 6 6 5 9 12 9 5 ...
..@ p : int [1:1201] 0 1 2 3 4 5 6 8 9 11 ...
..@ Dim : int [1:2] 20 1200
..@ Dimnames:List of 2
.. ..$ Docs : chr [1:20] "127" "144" "191" "194" ...
.. ..$ Terms: chr [1:1200] "\"(it)" "\"demand" "\"expansion" "\"for" ...
..@ x : num [1:1890] 4.32 4.32 4.32 4.32 4.32 ...
..@ factors : list()
您的其余代码:
library(text2vec)
lda_model <- LDA$new(
n_topics = 10,
doc_topic_prior = 0.1,
topic_word_prior = 0.01
)
doc_topic_distr <-
lda_model$fit_transform(
x = m,
n_iter = 1000,
convergence_tol = 0.001,
n_check_convergence = 25,
progressbar = FALSE
)
INFO [2018-04-15 10:40:00] iter 25 loglikelihood = -32949.882
INFO [2018-04-15 10:40:00] iter 50 loglikelihood = -32901.801
INFO [2018-04-15 10:40:00] iter 75 loglikelihood = -32922.208
INFO [2018-04-15 10:40:00] early stopping at 75 iteration