使用 ggplot2 facet grid 探索具有连续和分类变量的大型数据集
Using ggplot2 facet grid to explore large dataset with continuous and categorical variables
我有一个数据集,其中包含 >1000 个属于 A 组或 B 组的观察值,以及约 150 个分类变量和连续变量。下面的小版本。
set.seed(16)
mydf <- data.frame(ID = 1:50, group = sample(c("A", "B"), 50, replace = TRUE), length = rnorm(n = 50, mean = 0, sd = 1), weight = runif(50, min=0, max=1), color = sample(c("red", "orange", "yellow", "green", "blue"), 50, replace = TRUE), size = sample(c("big", "small"), 50, replace = TRUE))
我想直观地比较每个变量的 A 组和 B 组。首先,我想制作箱形图对,并排显示每个连续变量的 A 和 B,并为每个分类变量使用条形图。认为 ggplot facet_grid 对此很理想,但不确定如何根据数据类型指定绘图类型,也不确定如何在不逐个指定每个变量的情况下执行此操作。
对 ggplot2 帮助和任何替代探索技术感兴趣。
如果您分别制作绘图,然后将它们拼凑成一个网格,会怎样?
set.seed(16)
mydf <- data.frame(ID = 1:50, group = sample(c("A", "B"), 50, replace = TRUE), length = rnorm(n = 50, mean = 0, sd = 1), weight = runif(50, min=0, max=1), color = sample(c("red", "orange", "yellow", "green", "blue"), 50, replace = TRUE), size = sample(c("big", "small"), 50, replace = TRUE))
mydf
library(tidyverse)
library(cowplot)
library(reshape)
plot_continuous <- mydf %>%
melt(id = "group", measure.vars = c("length", "weight")) %>%
ggplot(aes(x = group, y = value)) +
geom_boxplot() +
facet_wrap(~variable)
plot_color <- mydf %>%
count(group, color) %>%
ggplot(aes(x = group, y = n)) +
geom_col(aes(fill = color), position = "dodge") +
ggtitle("Color")
plot_size <- mydf %>%
count(group, size) %>%
ggplot(aes(x = group, y = n)) +
geom_col(aes(fill = size), position = "dodge") +
ggtitle("Size")
plot_grid(plot_continuous, plot_color, plot_size, ncol = 2)
探索我们的数据可以说是我们研究中最有趣、最具智力挑战性的部分,因此我鼓励您多读一些关于这个主题的文章。
可视化当然很重要。 @Parfait 建议将您的数据塑造得较长,这使得绘图更容易。您混合使用连续数据和分类数据有点棘手。初学者通常会非常努力地避免重塑他们的数据——但没有必要担心!相反,您会发现大多数问题需要特定形状的数据,而在大多数情况下您找不到 "one fits all" 形状。
所以 - 真正的挑战是如何在绘图之前调整数据。显然有很多方法可以做到这一点。下面的一种方法应该有助于 "automatically" 重塑连续列和分类列。代码中的注释。
附带说明一下,在将数据加载到 R 中时,我会尽量避免将分类数据存储为因子,并且仅在需要时才转换为因子。如何执行此操作取决于您加载数据的方式。如果它来自 csv,您可以使用 read.csv('your.csv', stringsAsFactors = FALSE)
library(tidyverse)
``` r
# gathering numeric columns (without ID which is numeric).
# [I'd recommend against numeric IDs!!])
data_num <-
mydf %>%
select(-ID) %>%
pivot_longer(cols = which(sapply(., is.numeric)), names_to = 'key', values_to = 'value')
#No need to use facet here
ggplot(data_num) +
geom_boxplot(aes(key, value, color = group))
# selecting categorical columns is a bit more tricky in this example,
# because your group is also categorical.
# One way:
# first convert all categorical columns to character,
# then turn your "group" into factor
# then gather the character columns:
# gathering numeric columns (without ID which is numeric).
# [I'd recommend against numeric IDs!!])
# I use simple count() and mutate() to create a summary data frame with the proportions and geom_col, which equals geom_bar('stat = identity')
# There may be neater ways, but this is pretty straight forward
data_cat <-
mydf %>% select(-ID) %>%
mutate_if(.predicate = is.factor, .funs = as.character) %>%
mutate(group = factor(group)) %>%
pivot_longer(cols = which(sapply(., is.character)), names_to = 'key', values_to = 'value')%>%
count(group, key, value) %>%
group_by(group, key) %>%
mutate(percent = n/ sum(n)) %>%
ungroup # I always 'ungroup' after my data manipulations, in order to avoid unexpected effects
ggplot(data_cat) +
geom_col(aes(group, percent, fill = key)) +
facet_grid(~ value)
由 reprex package (v0.3.0)
于 2020-01-07 创建
信用如何有条件地收集到 来自@H1
我有一个数据集,其中包含 >1000 个属于 A 组或 B 组的观察值,以及约 150 个分类变量和连续变量。下面的小版本。
set.seed(16)
mydf <- data.frame(ID = 1:50, group = sample(c("A", "B"), 50, replace = TRUE), length = rnorm(n = 50, mean = 0, sd = 1), weight = runif(50, min=0, max=1), color = sample(c("red", "orange", "yellow", "green", "blue"), 50, replace = TRUE), size = sample(c("big", "small"), 50, replace = TRUE))
我想直观地比较每个变量的 A 组和 B 组。首先,我想制作箱形图对,并排显示每个连续变量的 A 和 B,并为每个分类变量使用条形图。认为 ggplot facet_grid 对此很理想,但不确定如何根据数据类型指定绘图类型,也不确定如何在不逐个指定每个变量的情况下执行此操作。
对 ggplot2 帮助和任何替代探索技术感兴趣。
如果您分别制作绘图,然后将它们拼凑成一个网格,会怎样?
set.seed(16)
mydf <- data.frame(ID = 1:50, group = sample(c("A", "B"), 50, replace = TRUE), length = rnorm(n = 50, mean = 0, sd = 1), weight = runif(50, min=0, max=1), color = sample(c("red", "orange", "yellow", "green", "blue"), 50, replace = TRUE), size = sample(c("big", "small"), 50, replace = TRUE))
mydf
library(tidyverse)
library(cowplot)
library(reshape)
plot_continuous <- mydf %>%
melt(id = "group", measure.vars = c("length", "weight")) %>%
ggplot(aes(x = group, y = value)) +
geom_boxplot() +
facet_wrap(~variable)
plot_color <- mydf %>%
count(group, color) %>%
ggplot(aes(x = group, y = n)) +
geom_col(aes(fill = color), position = "dodge") +
ggtitle("Color")
plot_size <- mydf %>%
count(group, size) %>%
ggplot(aes(x = group, y = n)) +
geom_col(aes(fill = size), position = "dodge") +
ggtitle("Size")
plot_grid(plot_continuous, plot_color, plot_size, ncol = 2)
探索我们的数据可以说是我们研究中最有趣、最具智力挑战性的部分,因此我鼓励您多读一些关于这个主题的文章。
可视化当然很重要。 @Parfait 建议将您的数据塑造得较长,这使得绘图更容易。您混合使用连续数据和分类数据有点棘手。初学者通常会非常努力地避免重塑他们的数据——但没有必要担心!相反,您会发现大多数问题需要特定形状的数据,而在大多数情况下您找不到 "one fits all" 形状。
所以 - 真正的挑战是如何在绘图之前调整数据。显然有很多方法可以做到这一点。下面的一种方法应该有助于 "automatically" 重塑连续列和分类列。代码中的注释。
附带说明一下,在将数据加载到 R 中时,我会尽量避免将分类数据存储为因子,并且仅在需要时才转换为因子。如何执行此操作取决于您加载数据的方式。如果它来自 csv,您可以使用 read.csv('your.csv', stringsAsFactors = FALSE)
library(tidyverse)
``` r
# gathering numeric columns (without ID which is numeric).
# [I'd recommend against numeric IDs!!])
data_num <-
mydf %>%
select(-ID) %>%
pivot_longer(cols = which(sapply(., is.numeric)), names_to = 'key', values_to = 'value')
#No need to use facet here
ggplot(data_num) +
geom_boxplot(aes(key, value, color = group))
# selecting categorical columns is a bit more tricky in this example,
# because your group is also categorical.
# One way:
# first convert all categorical columns to character,
# then turn your "group" into factor
# then gather the character columns:
# gathering numeric columns (without ID which is numeric).
# [I'd recommend against numeric IDs!!])
# I use simple count() and mutate() to create a summary data frame with the proportions and geom_col, which equals geom_bar('stat = identity')
# There may be neater ways, but this is pretty straight forward
data_cat <-
mydf %>% select(-ID) %>%
mutate_if(.predicate = is.factor, .funs = as.character) %>%
mutate(group = factor(group)) %>%
pivot_longer(cols = which(sapply(., is.character)), names_to = 'key', values_to = 'value')%>%
count(group, key, value) %>%
group_by(group, key) %>%
mutate(percent = n/ sum(n)) %>%
ungroup # I always 'ungroup' after my data manipulations, in order to avoid unexpected effects
ggplot(data_cat) +
geom_col(aes(group, percent, fill = key)) +
facet_grid(~ value)
由 reprex package (v0.3.0)
于 2020-01-07 创建信用如何有条件地收集到