将样本大小添加到 ggplot 中小平面的最小值或最大值处的箱线图

Question

关于如何用样本大小标记箱线图，有很多解释，包括 this good one。他们似乎都使用 max(x) 或 median(x) 来定位样本量。

我想知道是否有一种方法可以轻松地将标签定位在绘图的顶部或底部，尤其是在轴的最大值和最小值是的方面使用 scale = "free_y" 命令时ggplot 为每个方面自动选择。

原因是我正在创建多个面，其中分布很窄且面很小。如果样本量位于图的顶部或底部，阅读样本量会更容易......但我想使用 "free_y" 因为在某些方面存在有意义的差异，这些差异被方面掩盖了在数据中有更大的跨度。

使用链接 post:

中稍作修改的示例

# function for number of observations 
give.n <- function(x){
  return(c(y = median(x)*1.05, label = length(x))) 
  # experiment with the multiplier to find the perfect position
}

# function for mean labels
mean.n <- function(x){
  return(c(y = median(x)*0.97, label = round(mean(x),2))) 
  # experiment with the multiplier to find the perfect position
}

# plot
ggplot(mtcars, aes(factor(cyl), mpg, label=rownames(mtcars))) +
  geom_boxplot(fill = "grey80", colour = "#3366FF") +
  stat_summary(fun.data = give.n, geom = "text", fun.y = median) +
  stat_summary(fun.data = mean.n, geom = "text", fun.y = mean, colour = "red") +
  facet_grid(cyl~., scale="free_y")

鉴于此设置，我如何找到每个小平面的 x 轴的最小值或最大值并将样本大小放置在那里而不是每个盒须的中值、最小值或最大值？

编辑

我正在使用 R.S 中的信息更新问题。下面的回答。仍未得到答复，但他们的建议为在哪里可以找到这些信息提供了解决方案。

ggplot_build(gg)$layout$panel_ranges[[order(levels(factor(mtcars$cyl)))[1]]]$y.range[1]

给出 mtcars$cyl 的第一个因子的 y 范围的最小值。因此，按照我的逻辑，我们需要在没有 stat_summary 语句的情况下构建绘图，然后使用 give.n 函数找到样本大小和最小 y 范围。之后，我们可以将 stat_summary 语句添加到绘图中......如下所示：

# plot
gg = ggplot(mtcars, aes(factor(cyl), mpg, label=rownames(mtcars))) +
  geom_boxplot(fill = "grey80", colour = "#3366FF") +
  facet_grid(cyl~., scale="free_y")

# function for number of observations 
give.n <- function(x){
  return(c(y = ggplot_build(gg)$layout$panel_ranges[[order(levels(factor(mtcars$cyl)))[x]]]$y.range[1], label = length(x))) 
  # experiment with the multiplier to find the perfect position
}

gg +
  stat_summary(fun.data = give.n, geom = "text", fun.y = "median")

但是...上面的代码不起作用，因为我不太明白 give.n 函数迭代的是什么。将 [[x]] 替换为 1:3 中的任何一个都会绘制该方面最小的所有样本大小，因此这是进步。

这是使用 [[2]] 绘制的图，因此所有样本大小都绘制在 17.62，即第二个面的范围的最小值。

Answer 1

您可以使用 ggplot_build 检查 ggplot 对象的结构，特别是 x 和 y 面板范围存储在布局中。将您的情节分配给一个对象并查看结构：

gg <- ggplot(mtcars, aes(factor(cyl), mpg, label=rownames(mtcars))) +
  geom_boxplot(fill = "grey80", colour = "#3366FF") +
  stat_summary(fun.data = give.n, geom = "text", fun.y = median) +
  stat_summary(fun.data = mean.n, geom = "text", fun.y = mean, colour = "red") +
  facet_grid(cyl~., scale="free_y")

  ggplot_build(gg)

您尤其会感兴趣：

  ggplot_build(gg)$layout$panel_ranges

3 个面板的 ylim 以 c(ymin, ymax) 给出并存储在：

 ggplot_build(gg)$layout$panel_ranges[[1]]$y.range
 ggplot_build(gg)$layout$panel_ranges[[2]]$y.range
 ggplot_build(gg)$layout$panel_ranges[[3]]$y.range

编辑以回应评论以及如何将此布局信息合并到情节中。在这里，我们使用 dplyr 分别计算按 cyl 分组的统计摘要，并创建单独的数据框以合并到 ggplot2 中，而不是使用 stat_summary.

 library(dplyr)
 gg.summary <- group_by(mtcars, cyl) %>% summarise(mean=mean(mpg), median=median(mpg), length=length(mpg))

解析 ylim 范围并包含到统计摘要 df 中，统计摘要 df 按 cyl 分组，这是我们分面的变量：

 gg.summary$panel.ylim <- sapply(order(levels(factor(mtcars$cyl))), function(x) ggplot_build(gg)$layout$panel_ranges[[x]]$y.range[1])
 # # A tibble: 3 x 5
 # cyl     mean median length panel.ylim
 # <dbl>    <dbl>  <dbl>  <int>      <dbl>
 # 1     4 26.66364   26.0     11     20.775
 # 2     6 19.74286   19.7      7     17.620
 # 3     8 15.10000   15.2     14      9.960

在ggplot中使用，相信这就是你想要的情节：

 gg + geom_text(data=gg.summary, (aes(x=factor(cyl), y=panel.ylim, label=paste("n =",length)))) +
   geom_text(data=gg.summary, (aes(x=factor(cyl), y=median*0.97, label=format(median, nsmall=2))))

将样本大小添加到 ggplot 中小平面的最小值或最大值处的箱线图

Adding sample size to a box plot at the min or max of the facet in ggplot

r

facet

ggplot2