使用 ldatuning 库在 Latent Dirichlet Allocation 模型上查找主题数量时出错
Error while finding topics quantity on Latent Dirichlet Allocation model using ldatuning library
这是结果错误,我可以说这是因为至少有一个文档没有某个术语,但我不明白为什么以及如何解决它。
prep_fun = function(x) {
x %>%
str_to_lower %>% #make text lower case
str_replace_all("[^[:alpha:]]", " ") %>% #remove non-alpha symbols - chao punctuation y #
str_replace_all("\s+", " ") %>% #collapse multiple spaces
str_replace_all("\W*\b\w\b\W*", " ") #Remuevo letras individuales
}
tok_fun <- function(x) {
tokens <- word_tokenizer(x)
textstem::lemmatize_words(tokens)
}
it_patentes <- itoken(data$Abstract,
preprocessor = prep_fun,
tokenizer = tok_fun,
ids = data$id,
progressbar = F)
vocab <- create_vocabulary(it_patentes, ngram = c(ngram_min = 1L, ngram_max = 3L),
stopwords = tm::stopwords("english"))
pruned_vocab <- prune_vocabulary(vocab, term_count_min = max(vocab$term_count)*.01,
doc_proportion_min = 0.001)
vectorizer <- vocab_vectorizer(pruned_vocab)
dtm <- create_dtm(it_patentes, vectorizer,type = "dgTMatrix", progressbar = FALSE)
> #Plot the metrics to get number of topics
> t1 <- Sys.time()
> tunes <- FindTopicsNumber(
+ dtm = dtm,
+ topics = c(2:25),
+ metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010"),
+ method = "Gibbs",
+ control = list(seed = 17),
+ mc.cores = 4L,
+ verbose = TRUE
+ )
fit models...Error in checkForRemoteErrors(val) :
4 nodes produced errors; first error: Each row of the input matrix needs to contain at least one non-zero entry
> print(difftime(Sys.time(), t1, units = 'sec'))
Time difference of 9.155343 secs
> FindTopicsNumber_plot(tunes)
Error in base::subset(values, select = 2:ncol(values)) :
object 'tunes' not found
尽管我知道 ldatuning 是为主题模型制作的,但我认为获得一个数字开始测试可能不会有很大的不同,是吗?
ldatuning
需要不同格式的输入 dtm
矩阵(来自 topicmodels
包的格式)。您需要将 dtm
(Matrix 包中的稀疏矩阵)转换为 ldatuning
可以理解的格式
这是结果错误,我可以说这是因为至少有一个文档没有某个术语,但我不明白为什么以及如何解决它。
prep_fun = function(x) {
x %>%
str_to_lower %>% #make text lower case
str_replace_all("[^[:alpha:]]", " ") %>% #remove non-alpha symbols - chao punctuation y #
str_replace_all("\s+", " ") %>% #collapse multiple spaces
str_replace_all("\W*\b\w\b\W*", " ") #Remuevo letras individuales
}
tok_fun <- function(x) {
tokens <- word_tokenizer(x)
textstem::lemmatize_words(tokens)
}
it_patentes <- itoken(data$Abstract,
preprocessor = prep_fun,
tokenizer = tok_fun,
ids = data$id,
progressbar = F)
vocab <- create_vocabulary(it_patentes, ngram = c(ngram_min = 1L, ngram_max = 3L),
stopwords = tm::stopwords("english"))
pruned_vocab <- prune_vocabulary(vocab, term_count_min = max(vocab$term_count)*.01,
doc_proportion_min = 0.001)
vectorizer <- vocab_vectorizer(pruned_vocab)
dtm <- create_dtm(it_patentes, vectorizer,type = "dgTMatrix", progressbar = FALSE)
> #Plot the metrics to get number of topics
> t1 <- Sys.time()
> tunes <- FindTopicsNumber(
+ dtm = dtm,
+ topics = c(2:25),
+ metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010"),
+ method = "Gibbs",
+ control = list(seed = 17),
+ mc.cores = 4L,
+ verbose = TRUE
+ )
fit models...Error in checkForRemoteErrors(val) :
4 nodes produced errors; first error: Each row of the input matrix needs to contain at least one non-zero entry
> print(difftime(Sys.time(), t1, units = 'sec'))
Time difference of 9.155343 secs
> FindTopicsNumber_plot(tunes)
Error in base::subset(values, select = 2:ncol(values)) :
object 'tunes' not found
尽管我知道 ldatuning 是为主题模型制作的,但我认为获得一个数字开始测试可能不会有很大的不同,是吗?
ldatuning
需要不同格式的输入 dtm
矩阵(来自 topicmodels
包的格式)。您需要将 dtm
(Matrix 包中的稀疏矩阵)转换为 ldatuning
可以理解的格式