根据 beta 值命名 lda 主题建模中的主题

Question

我目前正在尝试为我必须写的论文开发代码。我想进行基于 LDA 的主题建模。我在 GitHub 上找到了一些代码库，并且能够将它们组合起来并在必要时稍微调整它们。现在我想添加一些内容，以分配给相应主题的具有最高 beta 值的词来命名每个已识别的主题。有任何想法吗？这是我第一次编写任何代码，因此我的专业知识非常有限。

这是我想插入“命名部分”的代码部分：

# get the top ten terms for each topic
  top_terms <- topics  %>% 
    group_by(topic) %>% # treat each topic as a different group
    top_n(10, beta) %>% # get top 10 words
    ungroup() %>% 
    arrange(topic, -beta) # arrange words in descending informativeness
   
# plot the top ten terms for each topic in order
    top_terms %>%
      mutate(term = reorder(term, beta)) %>% # sort terms by beta value 
      ggplot(aes(term, beta, fill = factor(topic))) + # plot beta by theme
      geom_col(show.legend = FALSE) + # as bar plot
      facet_wrap(~ topic, scales = "free") + # separate plot for each topic
      labs(x = NULL, y = "Beta") + # no x label, change y label 
      coord_flip() # turn bars sideways

我试图将它插入到代码的这一部分，但没有成功。我发现了这个：R topic modeling: lda model labeling function 但这对我不起作用，或者我不明白。

我不能透露更多代码，因为其中有一些合理的数据，但仍然非常感谢社区的一些专业知识。

祝好，注意安全

注意：它说 top_terms 是一个 tibble。我试图想出一些我头脑中的数据。 top_terms 中的数据结构完全像这样

主题术语测试版

(int)  (chr)  (dbl)
1   book    0,9876 
1   page    0,9765
1   chapter 0,9654
2   sports  0,8765
2   soccer  0,8654
2   champions   0,8543
3   music   0,9543
3   song    0,8678
3   artist  0,7231
4   movie   0,9846
4   cinema  0,9647
4   cast    0,8878

Answer 1

您可以在数据中创建一个额外的列，在按主题分组后，采用具有最高 beta 的术语的名称。

suppressPackageStartupMessages({
  library(ggplot2)
  library(tibble)
  library(dplyr)
})

# Just replicating example data
top_terms <- tibble(
  topic = rep(1:4, each = 3),
  term = c("book", "page", "chapter", 
           "sports", "soccer", "champions", 
           "music", "song", "artist",
           "movie", "cinema", "cast"),
  beta = c(0.9876, 0.9765, 0.9654,
           0.8765, 0.8654, 0.8543,
           0.9543, 0.8678, 0.7231,
           0.9846, 0.9647, 0.8878)
) 

top_terms %>%
  group_by(topic) %>%
  mutate(top_term = term[which.max(beta)]) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ top_term, scales = "free") +
  labs(x = NULL, y = "Beta") +
  coord_flip()

^{由 reprex package (v1.0.0)}

于 2021-05-05 创建

根据 beta 值命名 lda 主题建模中的主题

Name topics in lda topic modeling based on beta values

r

ggplot2

lda