Elbow/knee 在 R 中的曲线中
Elbow/knee in a curve in R
我有这样的数据处理:
library(text2vec)
##Using perplexity for hold out set
t1 <- Sys.time()
perplex <- c()
for (i in 3:25){
set.seed(17)
lda_model2 <- LDA$new(n_topics = i)
doc_topic_distr2 <- lda_model2$fit_transform(x = dtm, progressbar = F)
set.seed(17)
sample.dtm2 <- itoken(rawsample$Abstract,
preprocessor = prep_fun,
tokenizer = tok_fun,
ids = rawsample$id,
progressbar = F) %>%
create_dtm(vectorizer,vtype = "dgTMatrix", progressbar = FALSE)
set.seed(17)
new_doc_topic_distr2 <- lda_model2$transform(sample.dtm2, n_iter = 1000,
convergence_tol = 0.001, n_check_convergence = 25,
progressbar = FALSE)
perplex[i] <- text2vec::perplexity(sample.dtm2, topic_word_distribution =
lda_model2$topic_word_distribution,
doc_topic_distribution = new_doc_topic_distr2)
}
print(difftime(Sys.time(), t1, units = 'sec'))
我知道有很多这样的问题,但我一直无法找到适合我的情况的答案。在上方,您可以看到 Latent Dirichlet Allocation 模型从 3 到 25 个主题数的困惑度计算。我想在其中获得最充分的价值,这意味着我想找到肘部或膝盖,对于那些可能只被视为简单数字向量的值,其结果如下所示:
1 NA
2 NA
3 222.6229
4 210.3442
5 200.1335
6 190.3143
7 180.4195
8 174.2634
9 166.2670
10 159.7535
11 153.7785
12 148.1623
13 144.1554
14 141.8250
15 138.8301
16 134.4956
17 131.0745
18 128.8941
19 125.8468
20 123.8477
21 120.5155
22 118.4426
23 116.4619
24 113.2401
25 114.1233
plot(perplex)
This is how plot looks like
我会说肘部是 13 或 16,但我不完全确定,我想要确切的数字作为结果。我在 this paper 中看到 f''(x) / (1+f'(x)^2)^1.5 是膝盖公式,我这样试过并说它是 18:
> d1 <- diff(perplex) # first derivative
> d2 <- diff(d1) / diff(perplex[-1]) # second derivative
> knee <- (d2)/((1+(d1)^2)^1.5)
Warning message:
In (d2)/((1 + (d1)^2)^1.5) :
longer object length is not a multiple of shorter object length
> which.min(knee)
[1] 18
我无法完全弄清楚这件事。有人愿意分享我如何根据困惑作为结果得到准确的理想主题数吗?
在 this paper 中找到了这个:"The LDA model with the optimal coherence score, obtained with an elbow method (the point with maximum absolute second derivative) (...)",所以这个编码完成了工作:d1 <- diff(perplex); k <- which.max(abs(diff(d1) / diff(perplex[-1])))
我有这样的数据处理:
library(text2vec)
##Using perplexity for hold out set
t1 <- Sys.time()
perplex <- c()
for (i in 3:25){
set.seed(17)
lda_model2 <- LDA$new(n_topics = i)
doc_topic_distr2 <- lda_model2$fit_transform(x = dtm, progressbar = F)
set.seed(17)
sample.dtm2 <- itoken(rawsample$Abstract,
preprocessor = prep_fun,
tokenizer = tok_fun,
ids = rawsample$id,
progressbar = F) %>%
create_dtm(vectorizer,vtype = "dgTMatrix", progressbar = FALSE)
set.seed(17)
new_doc_topic_distr2 <- lda_model2$transform(sample.dtm2, n_iter = 1000,
convergence_tol = 0.001, n_check_convergence = 25,
progressbar = FALSE)
perplex[i] <- text2vec::perplexity(sample.dtm2, topic_word_distribution =
lda_model2$topic_word_distribution,
doc_topic_distribution = new_doc_topic_distr2)
}
print(difftime(Sys.time(), t1, units = 'sec'))
我知道有很多这样的问题,但我一直无法找到适合我的情况的答案。在上方,您可以看到 Latent Dirichlet Allocation 模型从 3 到 25 个主题数的困惑度计算。我想在其中获得最充分的价值,这意味着我想找到肘部或膝盖,对于那些可能只被视为简单数字向量的值,其结果如下所示:
1 NA
2 NA
3 222.6229
4 210.3442
5 200.1335
6 190.3143
7 180.4195
8 174.2634
9 166.2670
10 159.7535
11 153.7785
12 148.1623
13 144.1554
14 141.8250
15 138.8301
16 134.4956
17 131.0745
18 128.8941
19 125.8468
20 123.8477
21 120.5155
22 118.4426
23 116.4619
24 113.2401
25 114.1233
plot(perplex)
This is how plot looks like
我会说肘部是 13 或 16,但我不完全确定,我想要确切的数字作为结果。我在 this paper 中看到 f''(x) / (1+f'(x)^2)^1.5 是膝盖公式,我这样试过并说它是 18:
> d1 <- diff(perplex) # first derivative
> d2 <- diff(d1) / diff(perplex[-1]) # second derivative
> knee <- (d2)/((1+(d1)^2)^1.5)
Warning message:
In (d2)/((1 + (d1)^2)^1.5) :
longer object length is not a multiple of shorter object length
> which.min(knee)
[1] 18
我无法完全弄清楚这件事。有人愿意分享我如何根据困惑作为结果得到准确的理想主题数吗?
在 this paper 中找到了这个:"The LDA model with the optimal coherence score, obtained with an elbow method (the point with maximum absolute second derivative) (...)",所以这个编码完成了工作:d1 <- diff(perplex); k <- which.max(abs(diff(d1) / diff(perplex[-1])))