Elbow/knee 在 R 中的曲线中

Elbow/knee in a curve in R

我有这样的数据处理:

library(text2vec)

##Using perplexity for hold out set
t1 <- Sys.time()
perplex <- c()
for (i in 3:25){

  set.seed(17)
  lda_model2 <- LDA$new(n_topics = i)
  doc_topic_distr2 <- lda_model2$fit_transform(x = dtm,  progressbar = F)

  set.seed(17)
  sample.dtm2 <- itoken(rawsample$Abstract, 
                       preprocessor = prep_fun, 
                       tokenizer = tok_fun, 
                       ids = rawsample$id,
                       progressbar = F) %>%
    create_dtm(vectorizer,vtype = "dgTMatrix", progressbar = FALSE)

  set.seed(17)
  new_doc_topic_distr2 <- lda_model2$transform(sample.dtm2, n_iter = 1000, 
                                               convergence_tol = 0.001, n_check_convergence = 25, 
                                               progressbar = FALSE)

  perplex[i]  <- text2vec::perplexity(sample.dtm2, topic_word_distribution = 
                                        lda_model2$topic_word_distribution, 
                                      doc_topic_distribution = new_doc_topic_distr2) 

}
print(difftime(Sys.time(), t1, units = 'sec'))

我知道有很多这样的问题,但我一直无法找到适合我的情况的答案。在上方,您可以看到 Latent Dirichlet Allocation 模型从 3 到 25 个主题数的困惑度计算。我想在其中获得最充分的价值,这意味着我想找到肘部或膝盖,对于那些可能只被视为简单数字向量的值,其结果如下所示:

1   NA
2   NA
3   222.6229
4   210.3442
5   200.1335
6   190.3143
7   180.4195
8   174.2634
9   166.2670
10  159.7535
11  153.7785
12  148.1623
13  144.1554
14  141.8250
15  138.8301
16  134.4956
17  131.0745
18  128.8941
19  125.8468
20  123.8477
21  120.5155
22  118.4426
23  116.4619
24  113.2401
25  114.1233
plot(perplex)

This is how plot looks like

我会说肘部是 13 或 16,但我不完全确定,我想要确切的数字作为结果。我在 this paper 中看到 f''(x) / (1+f'(x)^2)^1.5 是膝盖公式,我这样试过并说它是 18:

> d1 <- diff(perplex)                # first derivative
> d2 <- diff(d1) / diff(perplex[-1]) # second derivative
> knee <- (d2)/((1+(d1)^2)^1.5)
Warning message:
In (d2)/((1 + (d1)^2)^1.5) :
  longer object length is not a multiple of shorter object length
> which.min(knee)
[1] 18

我无法完全弄清楚这件事。有人愿意分享我如何根据困惑作为结果得到准确的理想主题数吗?

this paper 中找到了这个:"The LDA model with the optimal coherence score, obtained with an elbow method (the point with maximum absolute second derivative) (...)",所以这个编码完成了工作:d1 <- diff(perplex); k <- which.max(abs(diff(d1) / diff(perplex[-1])))