使用 ggplot2 facet grid 探索具有连续和分类变量的大型数据集

Question

我有一个数据集，其中包含 >1000 个属于 A 组或 B 组的观察值，以及约 150 个分类变量和连续变量。下面的小版本。

set.seed(16)
mydf <- data.frame(ID = 1:50, group = sample(c("A", "B"), 50, replace = TRUE), length = rnorm(n = 50, mean = 0, sd = 1), weight = runif(50, min=0, max=1), color = sample(c("red", "orange", "yellow", "green", "blue"), 50,  replace = TRUE), size = sample(c("big", "small"), 50, replace = TRUE))

我想直观地比较每个变量的 A 组和 B 组。首先，我想制作箱形图对，并排显示每个连续变量的 A 和 B，并为每个分类变量使用条形图。认为 ggplot facet_grid 对此很理想，但不确定如何根据数据类型指定绘图类型，也不确定如何在不逐个指定每个变量的情况下执行此操作。

对 ggplot2 帮助和任何替代探索技术感兴趣。

Answer 1

如果您分别制作绘图，然后将它们拼凑成一个网格，会怎样？

set.seed(16)
mydf <- data.frame(ID = 1:50, group = sample(c("A", "B"), 50, replace = TRUE), length = rnorm(n = 50, mean = 0, sd = 1), weight = runif(50, min=0, max=1), color = sample(c("red", "orange", "yellow", "green", "blue"), 50,  replace = TRUE), size = sample(c("big", "small"), 50, replace = TRUE))


mydf


library(tidyverse)
library(cowplot)
library(reshape)

plot_continuous <- mydf %>%
    melt(id = "group", measure.vars = c("length", "weight")) %>%
    ggplot(aes(x = group, y = value)) +
    geom_boxplot() +
    facet_wrap(~variable)

plot_color <- mydf %>%
    count(group, color) %>%
    ggplot(aes(x = group, y = n)) +
    geom_col(aes(fill = color), position = "dodge") +
    ggtitle("Color")

plot_size <- mydf %>%
    count(group, size) %>%
    ggplot(aes(x = group, y = n)) +
    geom_col(aes(fill = size), position = "dodge") +
    ggtitle("Size")



plot_grid(plot_continuous, plot_color, plot_size, ncol = 2)

Answer 2

探索我们的数据可以说是我们研究中最有趣、最具智力挑战性的部分，因此我鼓励您多读一些关于这个主题的文章。
可视化当然很重要。 @Parfait 建议将您的数据塑造得较长，这使得绘图更容易。您混合使用连续数据和分类数据有点棘手。初学者通常会非常努力地避免重塑他们的数据——但没有必要担心！相反，您会发现大多数问题需要特定形状的数据，而在大多数情况下您找不到 "one fits all" 形状。
所以 - 真正的挑战是如何在绘图之前调整数据。显然有很多方法可以做到这一点。下面的一种方法应该有助于 "automatically" 重塑连续列和分类列。代码中的注释。

附带说明一下，在将数据加载到 R 中时，我会尽量避免将分类数据存储为因子，并且仅在需要时才转换为因子。如何执行此操作取决于您加载数据的方式。如果它来自 csv，您可以使用 read.csv('your.csv', stringsAsFactors = FALSE)

library(tidyverse)

``` r
# gathering numeric columns (without ID which is numeric).
#  [I'd recommend against numeric IDs!!])
data_num <- 
  mydf %>% 
  select(-ID) %>% 
  pivot_longer(cols = which(sapply(., is.numeric)), names_to = 'key', values_to =  'value')

#No need to use facet here
ggplot(data_num) +
  geom_boxplot(aes(key, value, color = group))

# selecting categorical columns is a bit more tricky in this example, 
# because your group is also categorical. 
# One way:
# first convert all categorical columns to character, 
# then turn your "group" into factor
# then gather the character columns: 

# gathering numeric columns (without ID which is numeric).
#  [I'd recommend against numeric IDs!!])

# I use simple count() and mutate() to create a summary data frame with the proportions and geom_col, which equals geom_bar('stat = identity')
# There may be neater ways, but this is pretty straight forward 

data_cat <- 
  mydf %>% select(-ID) %>%
  mutate_if(.predicate = is.factor, .funs = as.character) %>%
  mutate(group = factor(group)) %>%
  pivot_longer(cols = which(sapply(., is.character)), names_to = 'key', values_to =  'value')%>%
  count(group, key, value) %>%
  group_by(group, key) %>%
  mutate(percent =  n/ sum(n)) %>%
  ungroup # I always 'ungroup' after my data manipulations, in order to avoid unexpected effects

ggplot(data_cat) +
  geom_col(aes(group, percent, fill = key)) +
  facet_grid(~ value)

^{由 reprex package (v0.3.0)}

于 2020-01-07 创建

信用如何有条件地收集到来自@H1

使用 ggplot2 facet grid 探索具有连续和分类变量的大型数据集

Using ggplot2 facet grid to explore large dataset with continuous and categorical variables

r

data-visualization

frame

ggplot2