使用 textmineR 的 LDA 模型中每个文档的主题标签
Topic label of each document in LDA model using textmineR
我正在使用 textmineR 将 LDA 模型拟合到类似于 https://cran.r-project.org/web/packages/textmineR/vignettes/c_topic_modeling.html 的文档。是否可以获取数据集中每个文档的主题标签?
>library(textmineR)
>data(nih_sample)
> # create a document term matrix
> dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT,doc_names =
nih_sample$APPLICATION_ID, ngram_window = c(1, 2), stopword_vec =
c(stopwords::stopwords("en"), stopwords::stopwords(source = "smart")),lower
= TRUE, remove_punctuation = TRUE,remove_numbers = TRUE, verbose = FALSE,
cpus = 2)
>dtm <- dtm[,colSums(dtm) > 2]
>set.seed(123)
> model <- FitLdaModel(dtm = dtm, k = 20,iterations = 200,burnin =
180,alpha = 0.1, beta = 0.05, optimize_alpha = TRUE, calc_likelihood =
TRUE,calc_coherence = TRUE,calc_r2 = TRUE,cpus = 2)
然后将标签添加到模型:
> model$labels <- LabelTopics(assignments = model$theta > 0.05, dtm = dtm,
M = 1)
现在我想要 nih_sample$ABSTRACT_TEXT
中 100 个文档中每个文档的主题标签
您是否希望通过最流行主题的标签来标记每个文档?如果是这样,您可以这样做:
# convert labels to a data frame so we can merge
label_df <- data.frame(topic = rownames(model$labels), label = model$labels, stringsAsFactors = FALSE)
# get the top topic for each document
top_topics <- apply(model$theta, 1, function(x) names(x)[which.max(x)][1])
# convert the top topics for each document so we can merge
top_topics <- data.frame(document = names(top_topics), top_topic = top_topics, stringsAsFactors = FALSE)
# merge together. Now each document has a label from its top topic
top_topics <- merge(top_topics, label_df, by.x = "top_topic", by.y = "topic", all.x = TRUE)
不过,这种方法会丢弃您从 LDA 中获得的一些信息。 LDA 的一个优点是每个文档可以有多个主题。另一个是我们可以看到每个主题在该文档中有多少内容。你可以在这里
# set the plot margins to see the labels on the bottom
par(mar = c(8.1,4.1,4.1,2.1))
# barplot the first document's topic distribution with labels
barplot(model$theta[1,], names.arg = model$labels, las = 2)
我正在使用 textmineR 将 LDA 模型拟合到类似于 https://cran.r-project.org/web/packages/textmineR/vignettes/c_topic_modeling.html 的文档。是否可以获取数据集中每个文档的主题标签?
>library(textmineR)
>data(nih_sample)
> # create a document term matrix
> dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT,doc_names =
nih_sample$APPLICATION_ID, ngram_window = c(1, 2), stopword_vec =
c(stopwords::stopwords("en"), stopwords::stopwords(source = "smart")),lower
= TRUE, remove_punctuation = TRUE,remove_numbers = TRUE, verbose = FALSE,
cpus = 2)
>dtm <- dtm[,colSums(dtm) > 2]
>set.seed(123)
> model <- FitLdaModel(dtm = dtm, k = 20,iterations = 200,burnin =
180,alpha = 0.1, beta = 0.05, optimize_alpha = TRUE, calc_likelihood =
TRUE,calc_coherence = TRUE,calc_r2 = TRUE,cpus = 2)
然后将标签添加到模型:
> model$labels <- LabelTopics(assignments = model$theta > 0.05, dtm = dtm,
M = 1)
现在我想要 nih_sample$ABSTRACT_TEXT
您是否希望通过最流行主题的标签来标记每个文档?如果是这样,您可以这样做:
# convert labels to a data frame so we can merge
label_df <- data.frame(topic = rownames(model$labels), label = model$labels, stringsAsFactors = FALSE)
# get the top topic for each document
top_topics <- apply(model$theta, 1, function(x) names(x)[which.max(x)][1])
# convert the top topics for each document so we can merge
top_topics <- data.frame(document = names(top_topics), top_topic = top_topics, stringsAsFactors = FALSE)
# merge together. Now each document has a label from its top topic
top_topics <- merge(top_topics, label_df, by.x = "top_topic", by.y = "topic", all.x = TRUE)
不过,这种方法会丢弃您从 LDA 中获得的一些信息。 LDA 的一个优点是每个文档可以有多个主题。另一个是我们可以看到每个主题在该文档中有多少内容。你可以在这里
# set the plot margins to see the labels on the bottom
par(mar = c(8.1,4.1,4.1,2.1))
# barplot the first document's topic distribution with labels
barplot(model$theta[1,], names.arg = model$labels, las = 2)