LDA TopicModels 生成数字列表而不是术语
LDA TopicModels producing list of numbers rather than terms
请耐心等待,因为我对此非常陌生,并且正在为证书课程的课程项目工作。
我有 .csv 数据集,它是通过从 Pubmed 和 Embase 数据库中检索文献计量记录获得的。有 1034 行。有几列,但是,我试图从一列创建主题模型,摘要列和一些记录没有摘要。我已经完成了一些处理(删除停用词、标点符号等),并且能够绘制出出现超过 200 次的单词,并按排名创建一个常用词列表,还可以 运行 与所选单词的单词关联.所以,似乎 r 正在 Abstract 字段中看到单词本身。当我尝试使用 topicmodels 包创建主题模型时,我的问题就来了。这是我正在使用的代码。
#including 1st 3 lines for reference
options(header = FALSE, stringsAsFactors = FALSE, FileEncoding =
"latin1")
records <- read.csv("Combined.csv")
AbstractCorpus <- Corpus(VectorSource(records$Abstract))
AbstractTDM <- TermDocumentMatrix(AbstractCorpus)
library(topicmodels)
library(lda)
lda <- LDA(AbstractTDM, k = 8)
(term <- terms(lda, 6))
term <- (apply(term, MARGIN = 2, paste, collapse = ","))
但是,我得到的主题输出如下。
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8
[1,] "499" "733" "390" "833" "17" "413" "719" "392"
[2,] "484" "655" "808" "412" "550" "881" "721" "61"
[3,] "857" "299" "878" "909" "15" "258" "47" "164"
[4,] "491" "672" "313" "1028" "126" "55" "375" "987"
[5,] "734" "430" "405" "102" "13" "193" "83" "588"
[6,] "403" "52" "489" "10" "598" "52" "933" "980"
为什么我在这里看到的不是文字而是数字?
此外,我基本上从关于主题模型的 r PDF 中获取的以下代码确实为我产生了价值,但主题仍然是数字而不是文字,这对我来说毫无意义。
#using information from topicmodels paper
library(tm)
library(topicmodels)
library(lda)
AbstractTM <- list(VEM = LDA(AbstractTDM, k = 10, control = list(seed =
505)), VEM_fixed = LDA(AbstractTDM, k = 10, control = list(estimate.alpha
= FALSE, seed = 505)), Gibbs = LDA(AbstractTDM, k = 10, method = "Gibbs",
Control = list(seed = 505, burnin = 100, thin = 10, iter = 100)), CTM =
CTM(AbstractTDM, k = 10, control = list(seed = 505, var = list(tol =
10^-4), em = list(tol = 10^-3))))
#To compare the fitted models we first investigate the α values of the
models fitted with VEM and α estimated and with VEM and α fixed
sapply(AbstractTM[1:2], slot, "alpha")
#Find entropy
sapply(AbstractTM, function(x)mean(apply(posterior(x)$topics, 1,
function(z) - sum(z * log(z)))))
#Find estimated topics and terms
Topic <- topics(AbstractTM[["VEM"]], 1)
Topic
#find 5 most frequent terms for each topic
Terms <- terms(AbstractTM[["VEM"]], 5)
Terms[,1:5]
对可能出现的问题有什么想法吗?
阅读主题模型文档,LDA()
函数确实需要 DocumentTermMatrix
,而不是 TermDocumentMatrix
。尝试使用 DocumentTermMatrix(AbstractCorpus)
创建前者,看看是否可行。
请耐心等待,因为我对此非常陌生,并且正在为证书课程的课程项目工作。
我有 .csv 数据集,它是通过从 Pubmed 和 Embase 数据库中检索文献计量记录获得的。有 1034 行。有几列,但是,我试图从一列创建主题模型,摘要列和一些记录没有摘要。我已经完成了一些处理(删除停用词、标点符号等),并且能够绘制出出现超过 200 次的单词,并按排名创建一个常用词列表,还可以 运行 与所选单词的单词关联.所以,似乎 r 正在 Abstract 字段中看到单词本身。当我尝试使用 topicmodels 包创建主题模型时,我的问题就来了。这是我正在使用的代码。
#including 1st 3 lines for reference
options(header = FALSE, stringsAsFactors = FALSE, FileEncoding =
"latin1")
records <- read.csv("Combined.csv")
AbstractCorpus <- Corpus(VectorSource(records$Abstract))
AbstractTDM <- TermDocumentMatrix(AbstractCorpus)
library(topicmodels)
library(lda)
lda <- LDA(AbstractTDM, k = 8)
(term <- terms(lda, 6))
term <- (apply(term, MARGIN = 2, paste, collapse = ","))
但是,我得到的主题输出如下。
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8
[1,] "499" "733" "390" "833" "17" "413" "719" "392"
[2,] "484" "655" "808" "412" "550" "881" "721" "61"
[3,] "857" "299" "878" "909" "15" "258" "47" "164"
[4,] "491" "672" "313" "1028" "126" "55" "375" "987"
[5,] "734" "430" "405" "102" "13" "193" "83" "588"
[6,] "403" "52" "489" "10" "598" "52" "933" "980"
为什么我在这里看到的不是文字而是数字?
此外,我基本上从关于主题模型的 r PDF 中获取的以下代码确实为我产生了价值,但主题仍然是数字而不是文字,这对我来说毫无意义。
#using information from topicmodels paper
library(tm)
library(topicmodels)
library(lda)
AbstractTM <- list(VEM = LDA(AbstractTDM, k = 10, control = list(seed =
505)), VEM_fixed = LDA(AbstractTDM, k = 10, control = list(estimate.alpha
= FALSE, seed = 505)), Gibbs = LDA(AbstractTDM, k = 10, method = "Gibbs",
Control = list(seed = 505, burnin = 100, thin = 10, iter = 100)), CTM =
CTM(AbstractTDM, k = 10, control = list(seed = 505, var = list(tol =
10^-4), em = list(tol = 10^-3))))
#To compare the fitted models we first investigate the α values of the
models fitted with VEM and α estimated and with VEM and α fixed
sapply(AbstractTM[1:2], slot, "alpha")
#Find entropy
sapply(AbstractTM, function(x)mean(apply(posterior(x)$topics, 1,
function(z) - sum(z * log(z)))))
#Find estimated topics and terms
Topic <- topics(AbstractTM[["VEM"]], 1)
Topic
#find 5 most frequent terms for each topic
Terms <- terms(AbstractTM[["VEM"]], 5)
Terms[,1:5]
对可能出现的问题有什么想法吗?
阅读主题模型文档,LDA()
函数确实需要 DocumentTermMatrix
,而不是 TermDocumentMatrix
。尝试使用 DocumentTermMatrix(AbstractCorpus)
创建前者,看看是否可行。