如何可视化概率分布函数之间的差异？

Question

我尝试可视化分布函数的两个直方图之间的差异，例如以下两条曲线的差异：

当差异很大时，你可以把两条曲线重叠起来，然后按照上面的方法填充差异，但当差异变得很小时，这很麻烦。绘制此图的另一种方法是绘制差异本身，如下所示：

然而，对于第一次看到这样的图表的每个人来说，这似乎很难阅读，所以我想知道：有没有其他方法可以直观地显示两个分布函数之间的差异？

Answer 1

我认为也许可以将您的两个命题简单地结合起来，同时扩大差异以使其可见。

接下来是尝试使用 ggplot2 执行此操作。实际上，做这件事比我最初想象的要复杂得多，而且我对结果绝对不是百分百满意；但也许它仍然有帮助。非常欢迎提出意见和改进。

library(ggplot2)
library(dplyr)

## function that replicates default ggplot2 colors
## taken from [1]
gg_color_hue <- function(n) {
  hues = seq(15, 375, length=n+1)
  hcl(h=hues, l=65, c=100)[1:n]
}

## Set up sample data
set.seed(1)
n <- 2000
x1 <- rlnorm(n, 0, 1)
x2 <- rlnorm(n, 0, 1.1)
df <- bind_rows(data.frame(sample=1, x=x1), data.frame(sample=2, x=x2)) %>%
  mutate(sample = as.factor(sample))

## Calculate density estimates
g1 <- ggplot(df, aes(x=x, group=sample, colour=sample)) +
  geom_density(data = df) + xlim(0, 10)
gg1 <- ggplot_build(g1)

## Use these estimates (available at the same x coordinates!) for
## calculating the differences.
## Inspired by [2]
x <- gg1$data[[1]]$x[gg1$data[[1]]$group == 1]
y1 <- gg1$data[[1]]$y[gg1$data[[1]]$group == 1]
y2 <- gg1$data[[1]]$y[gg1$data[[1]]$group == 2]
df2 <- data.frame(x = x, ymin = pmin(y1, y2), ymax = pmax(y1, y2), 
                  side=(y1<y2), ydiff = y2-y1)
g2 <- ggplot(df2) +
   geom_ribbon(aes(x = x, ymin = ymin, ymax = ymax, fill = side, alpha = 0.5)) +
   geom_line(aes(x = x, y = 5 * abs(ydiff), colour = side)) +
   geom_area(aes(x = x, y = 5 * abs(ydiff), fill = side, alpha = 0.4))
g3 <- g2 + 
   geom_density(data = df, size = 1, aes(x = x, group = sample, colour = sample)) +
   xlim(0, 10) +
   guides(alpha = FALSE, colour = FALSE) +
   ylab("Curves: density\n Shaded area: 5 * difference of densities") +
   scale_fill_manual(name = "samples", labels = 1:2, values = gg_color_hue(2)) +
   scale_colour_manual(limits = list(1, 2, FALSE, TRUE), values = rep(gg_color_hue(2), 2))

print(g3)

来源：SO answer 1, SO answer 2

正如@Gregor 在评论中所建议的那样，这是一个在彼此下方绘制两个单独的图但共享相同的 x 轴缩放比例的版本。至少应该明显调整图例。

library(ggplot2)
library(dplyr)
library(grid)

## function that replicates default ggplot2 colors
## taken from [1]
gg_color_hue <- function(n) {
  hues = seq(15, 375, length=n+1)
  hcl(h=hues, l=65, c=100)[1:n]
}

## Set up sample data
set.seed(1)
n <- 2000
x1 <- rlnorm(n, 0, 1)
x2 <- rlnorm(n, 0, 1.1)
df <- bind_rows(data.frame(sample=1, x=x1), data.frame(sample=2, x=x2)) %>%
  mutate(sample = as.factor(sample))

## Calculate density estimates
g1 <- ggplot(df, aes(x=x, group=sample, colour=sample)) +
  geom_density(data = df) + xlim(0, 10)
gg1 <- ggplot_build(g1)

## Use these estimates (available at the same x coordinates!) for
## calculating the differences.
## Inspired by [2]
x <- gg1$data[[1]]$x[gg1$data[[1]]$group == 1]
y1 <- gg1$data[[1]]$y[gg1$data[[1]]$group == 1]
y2 <- gg1$data[[1]]$y[gg1$data[[1]]$group == 2]
df2 <- data.frame(x = x, ymin = pmin(y1, y2), ymax = pmax(y1, y2), 
                  side=(y1<y2), ydiff = y2-y1)
g2 <- ggplot(df2) +
   geom_ribbon(aes(x = x, ymin = ymin, ymax = ymax, fill = side, alpha = 0.5)) +
   geom_density(data = df, size = 1, aes(x = x, group = sample, colour = sample)) +
  xlim(0, 10) +
  guides(alpha = FALSE, fill = FALSE)
g3 <- ggplot(df2) +
   geom_line(aes(x = x, y = abs(ydiff), colour = side)) +
   geom_area(aes(x = x, y = abs(ydiff), fill = side, alpha = 0.4)) +
   guides(alpha = FALSE, fill = FALSE)
## See [3]
grid.draw(rbind(ggplotGrob(g2), ggplotGrob(g3), size="last"))

... 或在第二个情节的构造中将 abs(ydiff) 替换为 ydiff：

来源：SO answer 3

如何可视化概率分布函数之间的差异？

How to visualise the difference between probability distribution functions?

visualization

r

distribution

ggplot2