如何使用 Gibbs 采样计算 LDA 的困惑度
How to calculate perplexity for LDA with Gibbs sampling
我在 R 中对 200 多个文档(总共 65k 词)的集合执行 LDA 主题模型。文档已经过预处理并存储在文档-术语矩阵dtm
中。从理论上讲,我应该期望在语料库中找到 5 个不同的主题,但我想计算困惑度分数并查看模型如何随主题数量变化。下面是我使用的代码。问题是当我尝试计算困惑分数时它给了我一个错误,我不确定如何修复它(我是 R 的新手)。错误在最后一行代码中。如果有任何帮助,我将不胜感激。
burnin <- 4000 #burn-in parameter
iter <- 2000 # #of iteration after burn-in
thin <- 500 #take every 500th iteration for further use to avoid correlations between samples
seed <-list(2003,10,100,10005,765)
nstart <- 5 #use 5 different starting points
best <- TRUE #return results of the run with the highest posterior probability
#Number of topics (run the algorithm for different values of k and make a choice based by inspecting the results)
k <- 5
#Run LDA using Gibbs sampling
ldaOut <-LDA(dtm,k, method="Gibbs",
control=list(nstart=nstart, seed = seed, best=best,
burnin = burnin, iter = iter, thin=thin))
perplexity(ldaOut, newdata = dtm)
Error in method(x, k, control, model, mycall, ...) : Need 1 seeds
还需要一个参数"estimate_theta",
使用下面的代码:
perplexity(ldaOut, newdata = dtm,estimate_theta=FALSE)
我在 R 中对 200 多个文档(总共 65k 词)的集合执行 LDA 主题模型。文档已经过预处理并存储在文档-术语矩阵dtm
中。从理论上讲,我应该期望在语料库中找到 5 个不同的主题,但我想计算困惑度分数并查看模型如何随主题数量变化。下面是我使用的代码。问题是当我尝试计算困惑分数时它给了我一个错误,我不确定如何修复它(我是 R 的新手)。错误在最后一行代码中。如果有任何帮助,我将不胜感激。
burnin <- 4000 #burn-in parameter
iter <- 2000 # #of iteration after burn-in
thin <- 500 #take every 500th iteration for further use to avoid correlations between samples
seed <-list(2003,10,100,10005,765)
nstart <- 5 #use 5 different starting points
best <- TRUE #return results of the run with the highest posterior probability
#Number of topics (run the algorithm for different values of k and make a choice based by inspecting the results)
k <- 5
#Run LDA using Gibbs sampling
ldaOut <-LDA(dtm,k, method="Gibbs",
control=list(nstart=nstart, seed = seed, best=best,
burnin = burnin, iter = iter, thin=thin))
perplexity(ldaOut, newdata = dtm)
Error in method(x, k, control, model, mycall, ...) : Need 1 seeds
还需要一个参数"estimate_theta",
使用下面的代码:
perplexity(ldaOut, newdata = dtm,estimate_theta=FALSE)