将带阴影的标准误差曲线添加到 ggplot2 中的 geom_density

Question

我想使用 ggplot2 将带阴影的标准误差曲线添加到 geom_density。我的代码如下所示：

data.plot <- data.frame(x = c(rnorm(100, mean = 0, sd = 5),
                                    rnorm(100, mean = 1, sd =2 )), 
                             g = factor(c(rep(1, 100),
                                        rep(2,100))))

ggplot(data.plot, aes(x, linetype = g)) + geom_density()

我找不到这样做的教程或示例。谢谢。

Answer 1

最佳解决方案是使用 bootstrapping，如评论中所述。我将使用经典的 iris 数据，重点关注每个 Species[=59] 的 Sepal.Length 的密度=].

生成样本

library(dplyr) library(broom) library(tidyr) library(ggplot2) data_frame(bs = 1:1000) %>% group_by(bs) %>% mutate(data = list(iris %>% group_by(Species) %>% sample_frac(size = 1, replace = T)))

# A tibble: 1,000 x 2 # Groups: bs [1,000] bs data <int> <list> 1 1 <tibble [150 x 5]> 2 2 <tibble [150 x 5]> 3 3 <tibble [150 x 5]> 4 4 <tibble [150 x 5]> 5 5 <tibble [150 x 5]> 6 6 <tibble [150 x 5]> 7 7 <tibble [150 x 5]> 8 8 <tibble [150 x 5]> 9 9 <tibble [150 x 5]> 10 10 <tibble [150 x 5]> # ... with 990 more rows

所以我只是对原始数据进行了 1000 bootstrap 次复制，从每组中取出与其原始样本大小相同的行数，并进行替换。现在我必须 unnest 才能访问嵌套的 data 列中的数据。

计算样本内密度

densities.within <- data_frame(bs = 1:1000) %>% group_by(bs) %>% mutate(data = list(iris %>% group_by(Species) %>% sample_frac(size = 1, replace = T))) %>% unnest() %>% group_by(bs, Species) %>% do(tidy(density(.$Sepal.Length, from = min(iris$Sepal.Length), to = max(iris$Sepal.Length), n = 128)))

# A tibble: 384,000 x 4 # Groups: bs, Species [30,000] bs Species x y <int> <fctr> <dbl> <dbl> 1 1 setosa 4.300000 0.2395786 2 1 setosa 4.328346 0.2821128 3 1 setosa 4.356693 0.3235939 4 1 setosa 4.385039 0.3632449 5 1 setosa 4.413386 0.4010378 6 1 setosa 4.441732 0.4375189 7 1 setosa 4.470079 0.4734727 8 1 setosa 4.498425 0.5095333 9 1 setosa 4.526772 0.5459280 10 1 setosa 4.555118 0.5824587 # ... with 383,990 more rows

所以我们将数据扩展为长格式，然后取物种内每个组的Sepal.Length的密度bs。我们必须提供手动 from = 和 to =，因为每个 bootstrap 中的最小值和最大值可能不同（为了节省时间，设置比默认值 512 更低的 n =） . 为了简化生成的 S3: density 对象，我们使用 broom::tidy。这是计算密集型步骤，因此我们将此对象保存为 densities.within.

将密度汇总为分位数

这会产生名为 x 和 y 的列，但我们将重命名它们以匹配我们的数据。然后我们会弄清楚：对于每个计算可能的密度Sepal.Length，CI的下限是多少，中位数是多少，上限是多少CI 结束？我们将使用 quantile 来获取计算密度的这些特定值。

densities.qtiles <- densities.within %>% rename(Sepal.Length = x, dens = y) %>% ungroup() %>% group_by(Species, Sepal.Length) %>% summarise(q05 = quantile(dens, 0.025), q50 = quantile(dens, 0.5), q95 = quantile(dens, 0.975))

# A tibble: 384 x 5 # Groups: Species [?] Species Sepal.Length q05 q50 q95 <fctr> <dbl> <dbl> <dbl> <dbl> 1 setosa 4.300000 0.05730022 0.2355335 0.4426299 2 setosa 4.328346 0.08177850 0.2734463 0.4970097 3 setosa 4.356693 0.09863062 0.3114570 0.5505578 4 setosa 4.385039 0.12459033 0.3430645 0.5884523 5 setosa 4.413386 0.15049699 0.3705389 0.6207344 6 setosa 4.441732 0.17494889 0.4006335 0.6418923 7 setosa 4.470079 0.19836510 0.4258006 0.6655006 8 setosa 4.498425 0.21106857 0.4555755 0.6971370 9 setosa 4.526772 0.23399070 0.4813130 0.7244413 10 setosa 4.555118 0.24863090 0.5108057 0.7708114 # ... with 374 more rows

可视化和比较

ggplot(densities.qtiles, aes(Sepal.Length, q50)) + facet_wrap(~Species, nrow = 2) + geom_histogram(data = iris, aes(Sepal.Length, ..density..), colour = "black", fill = "white", binwidth = 0.25, boundary = 0) + geom_ribbon(aes(ymin = q05, ymax = q95), alpha = 0.5, fill = "grey50") + stat_density(data = iris, aes(Sepal.Length, ..density.., color = "raw density"), size = 2, geom = "line") + geom_line(size = 1.5, aes(color = "bootstrapped")) + scale_color_manual(values = c("red", "black")) + labs(y = "density") + theme(legend.position = c(0.5,0), legend.justification = c(0,0))

我还包括了原始数据的直方图和密度层，以便进行比较。您可以看到 1000 bootstrap 个样本的中值密度和原始密度非常接近。

将带阴影的标准误差曲线添加到 ggplot2 中的 geom_density

Add shaded standard error curves to geom_density in ggplot2

r

standard-error

ggplot2

probability-density

生成样本

计算样本内密度

将密度汇总为分位数

可视化和比较