在给定其他变量的情况下显示一个变量的存在百分比

Show percentage of presence of one variable given other variables

我目前正在研究一些情节,但我遇到了一个问题,根据我目前的 ggplot2 知识,我现在无法解决。

我将尝试使用我在 R 中创建的虚构数据来解释我的问题。下面我将 str 命令的输出留在我虚构的数据框上:

'data.frame':   15 obs. of  4 variables:
 $ x: Factor w/ 2 levels "0","1": 2 1 2 1 1 1 2 2 1 2 ...
 $ y: Factor w/ 2 levels "0","1": 1 2 2 1 2 2 2 1 1 1 ...
 $ w: Factor w/ 2 levels "0","1": 2 1 2 2 1 1 1 1 2 2 ...
 $ z: Factor w/ 2 levels "0","1": 2 1 2 2 1 1 2 2 1 2 ...

如您所见,这些都是二分变量。让我们考虑我的因变量是 y。我想要做的情节是如下图所示的条形图:

很想拍这样的剧情。另一个像这样,但也添加了一个带有 y 流行度的条形图,将自变量(x、w 和 z)为 1 的组与自变量(x、w 和 z)为 0 的组进行比较。因此,在第二个想法中,它将是 6 个小节而不是 3 个小节。但是这两个想法中的任何一个都可以很好地满足我需要做的事情。预先感谢社区,你总是很有帮助。

示例数据:

d <- structure(list(x = structure(c(2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L), .Label = c("0", "1"), class = "factor"),     y = structure(c(1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L,     2L, 2L, 2L, 1L), .Label = c("0", "1"), class = "factor"),     w = structure(c(2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L,     1L, 1L, 2L, 2L), .Label = c("0", "1"), class = "factor"),     z = structure(c(2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L,     2L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor")), row.names = c(NA, -15L), class = "data.frame")

更新 - 其实我一开始误解了这个问题。 很棒,而我的 table 方法(留给后人)太笨拙了。

下面是将 zx8754 的答案翻译成 tidyverse。

library(tidyverse)
d <- structure(list(x = structure(c(2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L), .Label = c("0", "1"), class = "factor"),     y = structure(c(1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L,     2L, 2L, 2L, 1L), .Label = c("0", "1"), class = "factor"),     w = structure(c(2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L,     1L, 1L, 2L, 2L), .Label = c("0", "1"), class = "factor"),     z = structure(c(2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L,     2L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor")), row.names = c(NA, -15L), class = "data.frame")

d %>% 
  pivot_longer(-y, names_to = "var", values_to = "val") %>%
  group_by(var, val) %>%
  summarise(perc = sum(y == 1)/ n())%>%
  ggplot(aes(var, perc)) +
  geom_col(aes(fill = as.factor(val)), position = "dodge") +
  scale_y_continuous(labels = scales::percent)
#> `summarise()` has grouped output by 'var'. You can override using the `.groups` argument.

reprex package (v1.0.0)

于 2021-04-07 创建

我会先 table,然后用它来绘制你的柱状图。

library(tidyverse)
d <- structure(list(x = structure(c(2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L), .Label = c("0", "1"), class = "factor"),     y = structure(c(1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L,     2L, 2L, 2L, 1L), .Label = c("0", "1"), class = "factor"),     w = structure(c(2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L,     1L, 1L, 2L, 2L), .Label = c("0", "1"), class = "factor"),     z = structure(c(2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L,     2L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor")), row.names = c(NA, -15L), class = "data.frame")

y <- d$y
tab_df <- data.frame(apply(d[c("x", "w", "z")], 2, function(x) {
  tab <- table(x[x != 0], y[x != 0])# first row will be y = 0
  tab / sum(tab) # for percentages
}))

tab_df %>% 
  mutate(y = 0:1) %>%
  pivot_longer(-y, names_to = "var", values_to = "percentage") %>%
  ggplot(aes(var, percentage)) +
  geom_col(aes(fill = as.factor(y))) +
  scale_y_continuous(labels = scales::percent)

reprex package (v1.0.0)

于 2021-04-07 创建

将宽转换为长,然后在每组 y 为一个时获取摘要:

library(data.table)
library(ggplot2)

# wide to long
setDT(d)
plotDat <- melt(d, id.vars = "y"
                )[ , .(yPC = sum(y == "1")/.N * 100),
                   by = .(variable, value)]

ggplot(plotDat, aes(variable, yPC, fill = value)) +
  geom_bar(stat = "identity", position = "dodge")

我和 stat_summary 一起做第一个问题时觉得有点有趣 - 不过看起来已经有一些很好的答案了,所以只是出于兴趣而注意到这个问题。

(经过编辑以更好地纠正百分比文本的垂直位置)

library(ggplot2)
library(dplyr)
library(tidyr)

d %>%
  pivot_longer(-y, names_to = "variable", values_to = "values") %>%
  ggplot(aes(variable, y = as.numeric(values))) +
  stat_summary(
    aes(label = scales::percent(after_stat(y))),
    geom = "text",
    fun = ~  sum(.x == 1) / 15,
    vjust = -1
  ) +
  stat_summary(geom = "bar", fun = ~ sum(.x == 1) / 15) +
  scale_y_continuous("Prevalence of 1s", labels = scales::percent, 
                     expand = expansion(add = c(NA, 0.05))) 

reprex package (v2.0.0)

于 2021-04-07 创建