使用 ggplot2 facet_grid 优化分类变量的绘图 - 二分变量的两个值中只有一个的绘图比例
Optimize plotting of categorical variables using ggplot2 facet_grid - plot proportion of only one of two values for dichotomous variables
我有一个大型数据集,其中包含 150 多个分类变量和连续变量。每个观察值(行)属于 A 组或 B 组。例如:
set.seed(16)
mydf <- data.frame(ID = 1:500, group = sample(c("A", "B", "B", "B"), 500, replace = TRUE),
length = rnorm(n = 500, mean = 0, sd = 1),
weight = runif(500, min=0, max=1),
color = sample(c("red", "orange", "yellow", "green", "blue"), 500, replace = TRUE),
size = sample(c("big", "small"), 500, replace = TRUE),
age = sample(c("old", "young"), 500, replace = T))
我正在努力优化绘图布局,以可视化分类变量的组计数和比例计数之间的关系。到目前为止,在之前 post () 的一些帮助下,我使用 ggplot2 facet-grid 进行了绘图,但遇到了两个问题。
问题 A:条形图按值(例如大、老、小、年轻)的字母顺序排列,而不是按类别分组(年龄:年轻紧挨着老;大小:大紧挨着小, ETC)。问题 B:对于只有两个可能值的分类变量,我只想绘制 A 组与 B 组中其中一个值的比例。例如,仅绘制 "old" 的 A 组与 B 组的比例,因为 "young" 的比例图不会提供任何新信息。其他分类变量,如具有多个值的颜色,应该为每种可能性绘制一个条形图。
我已经通过使用 " mutate(value = factor(value, levels=c("big", "small", "young", "old", "red", "orange", "yellow", "green", "blue")))" 现在绘图顺序按指定显示,年龄组彼此相邻,颜色彼此相邻,等等
data_cat <-
mydf %>% select(-ID) %>%
mutate_if(.predicate = is.factor, .funs = as.character) %>%
mutate(group = factor(group)) %>%
pivot_longer(cols = which(sapply(., is.character)), names_to = 'key', values_to = 'value')%>%
count(group, key, value) %>%
group_by(group, key) %>%
mutate(percent = n/ sum(n)) %>%
mutate(value = factor(value, levels=c("big", "small", "young", "old", "red", "orange", "yellow", "green", "blue"))) %>%
ggplot(data_cat) +
geom_col(aes(group, percent, fill = key)) +
facet_grid(~ value)
我仍然有问题 B,抑制二分分类变量的两个结果之一的绘图。我想我必须找到一种方法从每个变量中提取 "factor levels",然后使用该值为 == 2 的子集,已搜索但尚未找到执行此操作的方法。
这是您正在寻找的问题 B 的解决方案吗?我在您的数据管理步骤结束时添加了一个过滤器,删除了等于 "young" 和 "small" 的值,这两个二分法。该图现在显示 "big"、"old" 和 5 个颜色类别的条形图。
set.seed(16)
mydf <- data.frame(ID = 1:500, group = sample(c("A", "B", "B", "B"), 500, replace = TRUE),
length = rnorm(n = 500, mean = 0, sd = 1),
weight = runif(500, min=0, max=1),
color = sample(c("red", "orange", "yellow", "green", "blue"), 500, replace = TRUE),
size = sample(c("big", "small"), 500, replace = TRUE),
age = sample(c("old", "young"), 500, replace = T))
key <- lapply(mydf, function(x){ifelse(length(levels(x))==2, 1, 0)})
dichotomous <- names(which(key == 1))[-1]
mydf %>% select(-ID) %>%
mutate_if(.predicate = is.factor, .funs = as.character) %>%
mutate_at(.vars = vars(dichotomous), .funs = function(x){ifelse(x == unique(x)[2], NA, x)}) %>%
mutate(group = factor(group)) %>%
pivot_longer(cols = which(sapply(., is.character)), names_to = 'key', values_to = 'value')%>%
count(group, key, value) %>%
group_by(group, key) %>%
mutate(percent = n/ sum(n)) %>%
mutate(value = factor(value, levels=c("big", "small", "young", "old", "red", "orange", "yellow", "green", "blue"))) %>%
na.omit() -> data_cat
ggplot(data_cat) +
geom_col(aes(group, percent, fill = key)) +
facet_grid(~ value)
我有一个大型数据集,其中包含 150 多个分类变量和连续变量。每个观察值(行)属于 A 组或 B 组。例如:
set.seed(16)
mydf <- data.frame(ID = 1:500, group = sample(c("A", "B", "B", "B"), 500, replace = TRUE),
length = rnorm(n = 500, mean = 0, sd = 1),
weight = runif(500, min=0, max=1),
color = sample(c("red", "orange", "yellow", "green", "blue"), 500, replace = TRUE),
size = sample(c("big", "small"), 500, replace = TRUE),
age = sample(c("old", "young"), 500, replace = T))
我正在努力优化绘图布局,以可视化分类变量的组计数和比例计数之间的关系。到目前为止,在之前 post (
问题 A:条形图按值(例如大、老、小、年轻)的字母顺序排列,而不是按类别分组(年龄:年轻紧挨着老;大小:大紧挨着小, ETC)。问题 B:对于只有两个可能值的分类变量,我只想绘制 A 组与 B 组中其中一个值的比例。例如,仅绘制 "old" 的 A 组与 B 组的比例,因为 "young" 的比例图不会提供任何新信息。其他分类变量,如具有多个值的颜色,应该为每种可能性绘制一个条形图。
我已经通过使用 " mutate(value = factor(value, levels=c("big", "small", "young", "old", "red", "orange", "yellow", "green", "blue")))" 现在绘图顺序按指定显示,年龄组彼此相邻,颜色彼此相邻,等等
data_cat <-
mydf %>% select(-ID) %>%
mutate_if(.predicate = is.factor, .funs = as.character) %>%
mutate(group = factor(group)) %>%
pivot_longer(cols = which(sapply(., is.character)), names_to = 'key', values_to = 'value')%>%
count(group, key, value) %>%
group_by(group, key) %>%
mutate(percent = n/ sum(n)) %>%
mutate(value = factor(value, levels=c("big", "small", "young", "old", "red", "orange", "yellow", "green", "blue"))) %>%
ggplot(data_cat) +
geom_col(aes(group, percent, fill = key)) +
facet_grid(~ value)
我仍然有问题 B,抑制二分分类变量的两个结果之一的绘图。我想我必须找到一种方法从每个变量中提取 "factor levels",然后使用该值为 == 2 的子集,已搜索但尚未找到执行此操作的方法。
这是您正在寻找的问题 B 的解决方案吗?我在您的数据管理步骤结束时添加了一个过滤器,删除了等于 "young" 和 "small" 的值,这两个二分法。该图现在显示 "big"、"old" 和 5 个颜色类别的条形图。
set.seed(16)
mydf <- data.frame(ID = 1:500, group = sample(c("A", "B", "B", "B"), 500, replace = TRUE),
length = rnorm(n = 500, mean = 0, sd = 1),
weight = runif(500, min=0, max=1),
color = sample(c("red", "orange", "yellow", "green", "blue"), 500, replace = TRUE),
size = sample(c("big", "small"), 500, replace = TRUE),
age = sample(c("old", "young"), 500, replace = T))
key <- lapply(mydf, function(x){ifelse(length(levels(x))==2, 1, 0)})
dichotomous <- names(which(key == 1))[-1]
mydf %>% select(-ID) %>%
mutate_if(.predicate = is.factor, .funs = as.character) %>%
mutate_at(.vars = vars(dichotomous), .funs = function(x){ifelse(x == unique(x)[2], NA, x)}) %>%
mutate(group = factor(group)) %>%
pivot_longer(cols = which(sapply(., is.character)), names_to = 'key', values_to = 'value')%>%
count(group, key, value) %>%
group_by(group, key) %>%
mutate(percent = n/ sum(n)) %>%
mutate(value = factor(value, levels=c("big", "small", "young", "old", "red", "orange", "yellow", "green", "blue"))) %>%
na.omit() -> data_cat
ggplot(data_cat) +
geom_col(aes(group, percent, fill = key)) +
facet_grid(~ value)