将每个主题的输出指定为特定数量的单词
Specify the output per topic to a specific number of words
在 R 中进行 lda 主题建模后,一些词具有相同的 beta 值。因此,在绘制结果时将它们列在一起。这会导致重叠,有时甚至无法读取结果。
有没有办法将每个主题显示的字数限制为特定数量?
在我的虚拟数据集中,一些词具有相同的 beta 值。我想告诉 R 每个主题它应该只显示 3 个单词,或者根据需要显示任何指定的数字。
目前我用来绘制结果的代码如下所示:
top_terms %>% # take the top terms
group_by(topic) %>%
mutate(top_term = term[which.max(beta)]) %>%
mutate(term = reorder(term, beta)) %>%
head(3) %>% # I tried this but that only works for the first topic
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ top_term, scales = "free") +
labs(x = NULL, y = "Beta") + # no x label, change y label
coord_flip() # turn bars sideways
我尝试用 head(3)
解决问题,但只针对第一个主题。
我需要的是类似的东西,它不会忽略所有其他主题。
此致。
保持安全,保持健康。
注意:top_terms
是小题。
示例数据:
topic term beta
(int) (chr) (dbl)
1 book 0,9876
1 page 0,9765
1 chapter 0,9654
1 author 0,9654
2 sports 0,8765
2 soccer 0,8654
2 champions 0,8543
2 victory 0,8543
3 music 0,9543
3 song 0,8678
3 artist 0,7231
3 concert 0,7231
4 movie 0,9846
4 cinema 0,9647
4 cast 0,8878
4 story 0,8878
dput
样本数据
top_terms <- structure(list(topic = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 4L, 4L, 4L, 4L), term = c("book", "page", "chapter",
"author", "sports", "soccer", "champions", "victory", "music",
"song", "artist", "concert", "movie", "cinema", "cast", "story"
), beta = c(0.9876, 0.9765, 0.9654, 0.9654, 0.8765, 0.8654, 0.8543,
0.8543, 0.9543, 0.8678, 0.7231, 0.7231, 0.9846, 0.9647, 0.8878,
0.8878)), row.names = c(NA, -16L), class = "data.frame")
您可以执行以下操作
library(dplyr)
library(ggplot2)
# take the top terms
graph_data <- top_terms %>%
group_by(topic) %>%
mutate(top_term = term[which.max(beta)]) %>%
mutate(term = reorder(term, beta),
# popuplate index column which start 1 -> number of record for each topic
index = seq_len(n())) %>%
# filter by index <= 3
filter(index <= 3)
graph_data
#> # A tibble: 12 x 5
#> # Groups: topic [4]
#> topic term beta top_term index
#> <int> <fct> <dbl> <chr> <int>
#> 1 1 book 0.988 book 1
#> 2 1 page 0.976 book 2
#> 3 1 chapter 0.965 book 3
#> 4 2 sports 0.876 sports 1
#> 5 2 soccer 0.865 sports 2
#> 6 2 champions 0.854 sports 3
#> 7 3 music 0.954 music 1
#> 8 3 song 0.868 music 2
#> 9 3 artist 0.723 music 3
#> 10 4 movie 0.985 movie 1
#> 11 4 cinema 0.965 movie 2
#> 12 4 cast 0.888 movie 3
graph_data %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ top_term, scales = "free") +
labs(x = NULL, y = "Beta") + # no x label, change y label
coord_flip() # turn bars sideways
由 reprex package (v2.0.0)
于 2021-05-13 创建
slice_head
在分组字段上添加 group_by
后,将在此处完成工作而不是 head
top_terms %>% # take the top terms
group_by(topic) %>%
mutate(top_term = term[which.max(beta)]) %>%
mutate(term = reorder(term, beta)) %>%
group_by(top_term) %>%
slice_head(n=3) %>% # I tried this but that only works for the first topic
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ top_term, scales = "free") +
labs(x = NULL, y = "Beta") + # no x label, change y label
coord_flip()
在 R 中进行 lda 主题建模后,一些词具有相同的 beta 值。因此,在绘制结果时将它们列在一起。这会导致重叠,有时甚至无法读取结果。
有没有办法将每个主题显示的字数限制为特定数量? 在我的虚拟数据集中,一些词具有相同的 beta 值。我想告诉 R 每个主题它应该只显示 3 个单词,或者根据需要显示任何指定的数字。
目前我用来绘制结果的代码如下所示:
top_terms %>% # take the top terms
group_by(topic) %>%
mutate(top_term = term[which.max(beta)]) %>%
mutate(term = reorder(term, beta)) %>%
head(3) %>% # I tried this but that only works for the first topic
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ top_term, scales = "free") +
labs(x = NULL, y = "Beta") + # no x label, change y label
coord_flip() # turn bars sideways
我尝试用 head(3)
解决问题,但只针对第一个主题。
我需要的是类似的东西,它不会忽略所有其他主题。
此致。 保持安全,保持健康。
注意:top_terms
是小题。
示例数据:
topic term beta
(int) (chr) (dbl)
1 book 0,9876
1 page 0,9765
1 chapter 0,9654
1 author 0,9654
2 sports 0,8765
2 soccer 0,8654
2 champions 0,8543
2 victory 0,8543
3 music 0,9543
3 song 0,8678
3 artist 0,7231
3 concert 0,7231
4 movie 0,9846
4 cinema 0,9647
4 cast 0,8878
4 story 0,8878
dput
样本数据
top_terms <- structure(list(topic = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 4L, 4L, 4L, 4L), term = c("book", "page", "chapter",
"author", "sports", "soccer", "champions", "victory", "music",
"song", "artist", "concert", "movie", "cinema", "cast", "story"
), beta = c(0.9876, 0.9765, 0.9654, 0.9654, 0.8765, 0.8654, 0.8543,
0.8543, 0.9543, 0.8678, 0.7231, 0.7231, 0.9846, 0.9647, 0.8878,
0.8878)), row.names = c(NA, -16L), class = "data.frame")
您可以执行以下操作
library(dplyr)
library(ggplot2)
# take the top terms
graph_data <- top_terms %>%
group_by(topic) %>%
mutate(top_term = term[which.max(beta)]) %>%
mutate(term = reorder(term, beta),
# popuplate index column which start 1 -> number of record for each topic
index = seq_len(n())) %>%
# filter by index <= 3
filter(index <= 3)
graph_data
#> # A tibble: 12 x 5
#> # Groups: topic [4]
#> topic term beta top_term index
#> <int> <fct> <dbl> <chr> <int>
#> 1 1 book 0.988 book 1
#> 2 1 page 0.976 book 2
#> 3 1 chapter 0.965 book 3
#> 4 2 sports 0.876 sports 1
#> 5 2 soccer 0.865 sports 2
#> 6 2 champions 0.854 sports 3
#> 7 3 music 0.954 music 1
#> 8 3 song 0.868 music 2
#> 9 3 artist 0.723 music 3
#> 10 4 movie 0.985 movie 1
#> 11 4 cinema 0.965 movie 2
#> 12 4 cast 0.888 movie 3
graph_data %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ top_term, scales = "free") +
labs(x = NULL, y = "Beta") + # no x label, change y label
coord_flip() # turn bars sideways
由 reprex package (v2.0.0)
于 2021-05-13 创建slice_head
在分组字段上添加 group_by
后,将在此处完成工作而不是 head
top_terms %>% # take the top terms
group_by(topic) %>%
mutate(top_term = term[which.max(beta)]) %>%
mutate(term = reorder(term, beta)) %>%
group_by(top_term) %>%
slice_head(n=3) %>% # I tried this but that only works for the first topic
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ top_term, scales = "free") +
labs(x = NULL, y = "Beta") + # no x label, change y label
coord_flip()