将每个主题的输出指定为特定数量的单词

Question

在 R 中进行 lda 主题建模后，一些词具有相同的 beta 值。因此，在绘制结果时将它们列在一起。这会导致重叠，有时甚至无法读取结果。

有没有办法将每个主题显示的字数限制为特定数量？在我的虚拟数据集中，一些词具有相同的 beta 值。我想告诉 R 每个主题它应该只显示 3 个单词，或者根据需要显示任何指定的数字。

目前我用来绘制结果的代码如下所示：

top_terms %>% # take the top terms
      group_by(topic) %>%
      mutate(top_term = term[which.max(beta)]) %>% 
      mutate(term = reorder(term, beta)) %>% 
      head(3) %>% # I tried this but that only works for the first topic
      ggplot(aes(term, beta, fill = factor(topic))) + 
      geom_col(show.legend = FALSE) + 
      facet_wrap(~ top_term, scales = "free") + 
      labs(x = NULL, y = "Beta") + # no x label, change y label
      coord_flip() # turn bars sideways

我尝试用 head(3) 解决问题，但只针对第一个主题。我需要的是类似的东西，它不会忽略所有其他主题。

此致。保持安全，保持健康。

注意：top_terms 是小题。

示例数据：

topic   term      beta
(int)   (chr)     (dbl) 
1       book      0,9876 
1       page      0,9765
1       chapter   0,9654
1       author    0,9654
2       sports    0,8765
2       soccer    0,8654
2       champions   0,8543
2       victory   0,8543
3       music     0,9543
3       song      0,8678
3       artist    0,7231
3       concert   0,7231
4       movie     0,9846
4       cinema    0,9647
4       cast      0,8878
4       story     0,8878

dput 样本数据

top_terms <- structure(list(topic = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 
  3L, 3L, 3L, 4L, 4L, 4L, 4L), term = c("book", "page", "chapter", 
    "author", "sports", "soccer", "champions", "victory", "music", 
    "song", "artist", "concert", "movie", "cinema", "cast", "story"
  ), beta = c(0.9876, 0.9765, 0.9654, 0.9654, 0.8765, 0.8654, 0.8543, 
    0.8543, 0.9543, 0.8678, 0.7231, 0.7231, 0.9846, 0.9647, 0.8878, 
    0.8878)), row.names = c(NA, -16L), class = "data.frame")

Answer 1

您可以执行以下操作

library(dplyr)
library(ggplot2)

# take the top terms
graph_data <- top_terms %>%
  group_by(topic) %>%
  mutate(top_term = term[which.max(beta)]) %>% 
  mutate(term = reorder(term, beta),
    # popuplate index column which start 1 -> number of record for each topic
    index = seq_len(n())) %>% 
  # filter by index <= 3
  filter(index <= 3) 

graph_data
#> # A tibble: 12 x 5
#> # Groups:   topic [4]
#>    topic term       beta top_term index
#>    <int> <fct>     <dbl> <chr>    <int>
#>  1     1 book      0.988 book         1
#>  2     1 page      0.976 book         2
#>  3     1 chapter   0.965 book         3
#>  4     2 sports    0.876 sports       1
#>  5     2 soccer    0.865 sports       2
#>  6     2 champions 0.854 sports       3
#>  7     3 music     0.954 music        1
#>  8     3 song      0.868 music        2
#>  9     3 artist    0.723 music        3
#> 10     4 movie     0.985 movie        1
#> 11     4 cinema    0.965 movie        2
#> 12     4 cast      0.888 movie        3

graph_data %>%
  ggplot(aes(term, beta, fill = factor(topic))) + 
  geom_col(show.legend = FALSE) + 
  facet_wrap(~ top_term, scales = "free") + 
  labs(x = NULL, y = "Beta") + # no x label, change y label
  coord_flip() # turn bars sideways

^{由 reprex package (v2.0.0)}

于 2021-05-13 创建

Answer 2

slice_head 在分组字段上添加 group_by 后，将在此处完成工作而不是 head

top_terms %>% # take the top terms
  group_by(topic) %>%
  mutate(top_term = term[which.max(beta)]) %>% 
  mutate(term = reorder(term, beta)) %>% 
  group_by(top_term) %>%
  slice_head(n=3) %>% # I tried this but that only works for the first topic
  ggplot(aes(term, beta, fill = factor(topic))) + 
  geom_col(show.legend = FALSE) + 
  facet_wrap(~ top_term, scales = "free") + 
  labs(x = NULL, y = "Beta") + # no x label, change y label
  coord_flip()

将每个主题的输出指定为特定数量的单词

Specify the output per topic to a specific number of words

r

ggplot2

lda

topic-modeling

tibble