将 `top_n` 和 `arrange` 传递给 ggplot (dplyr)

Passing `top_n` and `arrange` to ggplot (dplyr)

TidyText Mining Section 3.3 中有一段可爱的代码,我正试图在我自己的数据集中复制这些代码。但是,在我的数据中,我无法将 ggplot 获取到 'remember',我希望数据按降序排列,并且我想要某个 top_n.

我可以 运行 来自 TidyText Mining 的代码,并且我得到了书中显示的相同图表。但是,当我在自己的数据集上 运行 时,facet wraps 不显示 top_n (它们似乎显示随机数量的类别)并且每个 facet 中的数据未按降序排序。

我可以用一些随机文本数据和完整代码复制这个问题——但我也可以用 mtcars 复制这个问题——这让我很困惑。

我希望下表按每个方面的降序显示 mpg,并且每个方面只给我顶部 1 类别。它不适合我。

require(tidyverse)

mtcars %>%
  arrange (desc(mpg)) %>%
  mutate (gear = factor(gear, levels = rev(unique(gear)))) %>%
  group_by(am) %>%
  top_n(1) %>%
  ungroup %>%
  ggplot (aes (gear, mpg, fill = am)) +
  geom_col (show.legend = FALSE) +
  labs (x = NULL, y = "mpg") +
  facet_wrap(~am, ncol = 2, scales = "free") + 
  coord_flip()

但我真正想要的是像 TidyText 书中那样排序的图表(仅供示例数据)。

require(tidyverse)
require(tidytext)

starwars <- tibble (film = c("ANH", "ESB", "ROJ"),
                  text = c("It is a period of civil war. Rebel spaceships, striking from a hidden base, have won their first victory against the evil Galactic Empire. During the battle, Rebel spies managed to steal secret plans to the Empire's ultimate weapon, the DEATH STAR, an armored space station with enough power to destroy an entire planet. Pursued by the Empire's sinister agents, Princess Leia races home aboard her starship, custodian of the stolen plans that can save her people and restore freedom to the galaxy.....",
                           "It is a dark time for the Rebellion. Although the Death Star has been destroyed, Imperial troops have driven the Rebel forces from their hidden base and pursued them across the galaxy. Evading the dreaded Imperial Starfleet, a group of freedom fighters led by Luke Skywalker has established a new secret base on the remote ice world of Hoth. The evil lord Darth Vader, obsessed with finding young Skywalker, has dispatched thousands of remote probes into the far reaches of space....",
                           "Luke Skywalker has returned to his home planet of Tatooine in an attempt to rescue his friend Han Solo from the clutches of the vile gangster Jabba the Hutt. Little does Luke know that the GALACTIC EMPIRE has secretly begun construction on a new armored space station even more powerful than the first dreaded Death Star. When completed, this ultimate weapon will spell certain doom for the small band of rebels struggling to restore freedom to the galaxy...")) %>%
  unnest_tokens(word, text) %>%
  mutate(film = as.factor(film)) %>%
  count(film, word, sort = TRUE) %>%
  ungroup()

total_wars <- starwars %>%
  group_by(film) %>%
  summarize(total = sum(n))

starwars <- left_join(starwars, total_wars)

starwars <- starwars %>%
  bind_tf_idf(word, film, n)

starwars %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  group_by(film) %>%
  top_n(10) %>%
  ungroup %>%
  ggplot(aes(word, tf_idf, fill = film)) +
  geom_col(show.legend = FALSE) +
  labs (x = NULL, y = "tf-idf") +
  facet_wrap(~film, ncol = 2, scales = "free") +
  coord_flip()

我相信这里让您感到困惑的是 top_n() 默认为 table 中的最后一个变量,除非您告诉它要使用哪个变量进行排序。在我们书中的示例中,数据框中的最后一个变量是 tf_idf,因此这是用于排序的。在 mtcars 示例中,top_n() 使用数据框中的最后一列进行排序;恰好是 carb.

您始终可以告诉 top_n() 您希望将哪个变量作为参数传递来进行排序。例如,使用 diamonds 数据集检查这个类似的工作流程。

library(tidyverse)

diamonds %>%
  arrange(desc(price)) %>%
  group_by(clarity) %>%
  top_n(10, price) %>%
  ungroup %>%
  ggplot(aes(cut, price, fill = clarity)) +
  geom_col(show.legend = FALSE, ) +
  facet_wrap(~clarity, scales = "free") + 
  scale_x_discrete(drop=FALSE) +
  coord_flip()

reprex package (v0.2.0) 创建于 2018-05-17。

这些示例数据集不是完美的平行数据集,因为它们不像整洁的文本数据框那样,每个特征组合一行。不过,我很确定 top_n() 的问题就是问题所在。