根据 beta 值命名 lda 主题建模中的主题

Name topics in lda topic modeling based on beta values

我目前正在尝试为我必须写的论文开发代码。我想进行基于 LDA 的主题建模。我在 GitHub 上找到了一些代码库,并且能够将它们组合起来并在必要时稍微调整它们。 现在我想添加一些内容,以分配给相应主题的具有最高 beta 值的词来命名每个已识别的主题。 有任何想法吗?这是我第一次编写任何代码,因此我的专业知识非常有限。


# get the top ten terms for each topic
  top_terms <- topics  %>% 
    group_by(topic) %>% # treat each topic as a different group
    top_n(10, beta) %>% # get top 10 words
    ungroup() %>% 
    arrange(topic, -beta) # arrange words in descending informativeness
# plot the top ten terms for each topic in order
    top_terms %>%
      mutate(term = reorder(term, beta)) %>% # sort terms by beta value 
      ggplot(aes(term, beta, fill = factor(topic))) + # plot beta by theme
      geom_col(show.legend = FALSE) + # as bar plot
      facet_wrap(~ topic, scales = "free") + # separate plot for each topic
      labs(x = NULL, y = "Beta") + # no x label, change y label 
      coord_flip() # turn bars sideways

我试图将它插入到代码的这一部分,但没有成功。 我发现了这个:R topic modeling: lda model labeling function 但这对我不起作用,或者我不明白。



注意:它说 top_terms 是一个 tibble。我试图想出一些我头脑中的数据。 top_terms 中的数据结构完全像这样


(int)  (chr)  (dbl)
1   book    0,9876 
1   page    0,9765
1   chapter 0,9654
2   sports  0,8765
2   soccer  0,8654
2   champions   0,8543
3   music   0,9543
3   song    0,8678
3   artist  0,7231
4   movie   0,9846
4   cinema  0,9647
4   cast    0,8878

您可以在数据中创建一个额外的列,在按主题分组后,采用具有最高 beta 的术语的名称。


# Just replicating example data
top_terms <- tibble(
  topic = rep(1:4, each = 3),
  term = c("book", "page", "chapter", 
           "sports", "soccer", "champions", 
           "music", "song", "artist",
           "movie", "cinema", "cast"),
  beta = c(0.9876, 0.9765, 0.9654,
           0.8765, 0.8654, 0.8543,
           0.9543, 0.8678, 0.7231,
           0.9846, 0.9647, 0.8878)

top_terms %>%
  group_by(topic) %>%
  mutate(top_term = term[which.max(beta)]) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ top_term, scales = "free") +
  labs(x = NULL, y = "Beta") +

于 2021-05-05 创建