使用 ldatuning 库在 Latent Dirichlet Allocation 模型上查找主题数量时出错

Question

这是结果错误，我可以说这是因为至少有一个文档没有某个术语，但我不明白为什么以及如何解决它。

prep_fun = function(x) {
  x %>% 
    str_to_lower                         %>%   #make text lower case
    str_replace_all("[^[:alpha:]]", " ") %>%   #remove non-alpha symbols - chao punctuation y #
    str_replace_all("\s+", " ")         %>%   #collapse multiple spaces 
    str_replace_all("\W*\b\w\b\W*", " ")  #Remuevo letras individuales
}
tok_fun <- function(x) {
  tokens <- word_tokenizer(x)
  textstem::lemmatize_words(tokens)
}
it_patentes <- itoken(data$Abstract, 
                      preprocessor = prep_fun, 
                      tokenizer = tok_fun, 
                      ids = data$id,
                      progressbar = F)
vocab <- create_vocabulary(it_patentes, ngram = c(ngram_min = 1L, ngram_max = 3L), 
                           stopwords = tm::stopwords("english"))
pruned_vocab <- prune_vocabulary(vocab, term_count_min =  max(vocab$term_count)*.01, 
                                 doc_proportion_min = 0.001)   
vectorizer <- vocab_vectorizer(pruned_vocab) 
dtm <- create_dtm(it_patentes, vectorizer,type = "dgTMatrix", progressbar = FALSE)   

> #Plot the metrics to get number of topics 
> t1 <- Sys.time()
> tunes <- FindTopicsNumber(
+   dtm = dtm,
+   topics = c(2:25),
+   metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010"),
+   method = "Gibbs",
+   control = list(seed = 17),
+   mc.cores = 4L,
+   verbose = TRUE
+ )
fit models...Error in checkForRemoteErrors(val) : 
  4 nodes produced errors; first error: Each row of the input matrix needs to contain at least one non-zero entry
> print(difftime(Sys.time(), t1, units = 'sec'))
Time difference of 9.155343 secs
> FindTopicsNumber_plot(tunes)
Error in base::subset(values, select = 2:ncol(values)) : 
  object 'tunes' not found

尽管我知道 ldatuning 是为主题模型制作的，但我认为获得一个数字开始测试可能不会有很大的不同，是吗？

Answer 1

ldatuning 需要不同格式的输入 dtm 矩阵（来自 topicmodels 包的格式）。您需要将 dtm（Matrix 包中的稀疏矩阵）转换为 ldatuning 可以理解的格式

使用 ldatuning 库在 Latent Dirichlet Allocation 模型上查找主题数量时出错

Error while finding topics quantity on Latent Dirichlet Allocation model using ldatuning library

text-mining

lda

text2vec