您是否需要标记化您的文本以可视化来自 LDA 主题模型的数据?
Do you need to tokenize your text to visualize data from a LDA topic model?
我目前在 2016-2019 年的新闻文章中使用 textmineR 包 运行 LDA 主题模型。
但是,我对 R 很陌生,我不知道如何显示模型的结果。
我想显示我的模型在我收集数据的时间段内发现的 8 个主题的流行程度。数据结构化在数据框中。我的数据在日常级别定义为 %y-%m-%d
我的LDA模型是这样制作的:
## get textmineR dtm
dtm <- CreateDtm(doc_vec = dat$fulltext, # character vector of documents
ngram_window = c(1, 2),
doc_names = dat$names,
stopword_vec = c(stopwords::stopwords("da"), custom_stopwords),
lower = T, # lowercase - this is the default value
remove_punctuation = T, # punctuation - this is the default
remove_numbers = T, # numbers - this is the default
verbose = T,
cpus = 4)
dtm <- dtm[, colSums(dtm) > 3]
dtm <- dtm[, str_length(colnames(dtm)) > 3]
############################################################
## RUN & EXAMINE TOPIC MODEL
############################################################
# Draw quasi-random sample from the pc
set.seed(34838)
model <- FitLdaModel(dtm = dtm,
k = 8,
iterations = 500,
burnin = 200,
alpha = 0.1,
beta = 0.05,
optimize_alpha = TRUE,
calc_likelihood = TRUE,
calc_coherence = TRUE,
calc_r2 = TRUE,
cpus = 4)
# model log-likelihood
plot(model$log_likelihood, type = "l")
# topic coherence
summary(model$coherence)
hist(model$coherence,
col= "blue",
main = "Histogram of probabilistic coherence")
# top terms by topic
model$top_terms1 <- GetTopTerms(phi = model$phi, M = 10)
t(model$top_terms1)
# topic prevalence
model$prevalence <- colSums(model$theta) / sum(model$theta) * 100
# prevalence should be proportional to alpha
plot(model$prevalence, model$alpha, xlab = "prevalence", ylab = "alpha")
谁能告诉我如何绘制模型随时间推移发现的最流行的主题?
我需要标记文本或类似的东西吗?
我希望这是有道理的。
最好的,
标记化发生在 CreateDtm
函数中。所以,这听起来不像是你的问题。
您可以通过对 theta
的列取平均值来获得主题在一组文档中的流行度,矩阵是结果模型的一部分。
我无法用你的数据给你一个确切的答案,但我可以用 textmineR
附带的 nih_sample
数据给你一个类似的例子
# load the NIH sample data
data(nih_sample)
# create a dtm and topic model
dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT,
doc_names = nih_sample$APPLICATION_ID)
m <- FitLdaModel(dtm = dtm, k = 20, iterations = 100, burnin = 75)
# aggregate theta by the year of the PROJECT_END variable
end_year <- stringr::str_split(string = nih_sample$PROJECT_END, pattern = "/")
end_year <- sapply(end_year, function(x) x[length(x)])
end_year <- as.numeric(end_year)
topic_by_year <- by(data = m$theta, INDICES = end_year, FUN = function(x){
if (is.null(nrow(x))) {
# if only one row, gets converted to a vector
# just return that vector
return(x)
} else { # if multiple rows, then aggregate
return(colMeans(x))
}
})
topic_by_year <- as.data.frame(do.call(rbind, topic_by_year))
topic_by_year <- as.data.frame(do.call(rbind, topic_by_year))
# plot topic 10's prevalence by year
plot(topic_by_year$year, topic_by_year$t_10, type = "l")
我目前在 2016-2019 年的新闻文章中使用 textmineR 包 运行 LDA 主题模型。 但是,我对 R 很陌生,我不知道如何显示模型的结果。
我想显示我的模型在我收集数据的时间段内发现的 8 个主题的流行程度。数据结构化在数据框中。我的数据在日常级别定义为 %y-%m-%d
我的LDA模型是这样制作的:
## get textmineR dtm
dtm <- CreateDtm(doc_vec = dat$fulltext, # character vector of documents
ngram_window = c(1, 2),
doc_names = dat$names,
stopword_vec = c(stopwords::stopwords("da"), custom_stopwords),
lower = T, # lowercase - this is the default value
remove_punctuation = T, # punctuation - this is the default
remove_numbers = T, # numbers - this is the default
verbose = T,
cpus = 4)
dtm <- dtm[, colSums(dtm) > 3]
dtm <- dtm[, str_length(colnames(dtm)) > 3]
############################################################
## RUN & EXAMINE TOPIC MODEL
############################################################
# Draw quasi-random sample from the pc
set.seed(34838)
model <- FitLdaModel(dtm = dtm,
k = 8,
iterations = 500,
burnin = 200,
alpha = 0.1,
beta = 0.05,
optimize_alpha = TRUE,
calc_likelihood = TRUE,
calc_coherence = TRUE,
calc_r2 = TRUE,
cpus = 4)
# model log-likelihood
plot(model$log_likelihood, type = "l")
# topic coherence
summary(model$coherence)
hist(model$coherence,
col= "blue",
main = "Histogram of probabilistic coherence")
# top terms by topic
model$top_terms1 <- GetTopTerms(phi = model$phi, M = 10)
t(model$top_terms1)
# topic prevalence
model$prevalence <- colSums(model$theta) / sum(model$theta) * 100
# prevalence should be proportional to alpha
plot(model$prevalence, model$alpha, xlab = "prevalence", ylab = "alpha")
谁能告诉我如何绘制模型随时间推移发现的最流行的主题? 我需要标记文本或类似的东西吗?
我希望这是有道理的。 最好的,
标记化发生在 CreateDtm
函数中。所以,这听起来不像是你的问题。
您可以通过对 theta
的列取平均值来获得主题在一组文档中的流行度,矩阵是结果模型的一部分。
我无法用你的数据给你一个确切的答案,但我可以用 textmineR
nih_sample
数据给你一个类似的例子
# load the NIH sample data
data(nih_sample)
# create a dtm and topic model
dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT,
doc_names = nih_sample$APPLICATION_ID)
m <- FitLdaModel(dtm = dtm, k = 20, iterations = 100, burnin = 75)
# aggregate theta by the year of the PROJECT_END variable
end_year <- stringr::str_split(string = nih_sample$PROJECT_END, pattern = "/")
end_year <- sapply(end_year, function(x) x[length(x)])
end_year <- as.numeric(end_year)
topic_by_year <- by(data = m$theta, INDICES = end_year, FUN = function(x){
if (is.null(nrow(x))) {
# if only one row, gets converted to a vector
# just return that vector
return(x)
} else { # if multiple rows, then aggregate
return(colMeans(x))
}
})
topic_by_year <- as.data.frame(do.call(rbind, topic_by_year))
topic_by_year <- as.data.frame(do.call(rbind, topic_by_year))
# plot topic 10's prevalence by year
plot(topic_by_year$year, topic_by_year$t_10, type = "l")