使用 ggplot 绘制两条重叠的密度曲线

Question

我在 R 中有一个包含 104 列的数据框，显示如下：

   id         vcr1       vcr2         vcr3  sim_vcr1  sim_vcr2  sim_vcr3  sim_vcr4  sim_vcr5  sim_vcr6  sim_vcr7
1 2913 -4.782992840  1.7631999  0.003768704  1.376937 -2.096857  6.903021  7.018855  6.135139  3.188382  6.905323
2 1260  0.003768704  3.1577108 -0.758378208  1.376937 -2.096857  6.903021  7.018855  6.135139  3.188382  6.905323
3 2912 -4.782992840  1.7631999  0.003768704  1.376937 -2.096857  6.903021  7.018855  6.135139  3.188382  6.905323
4 2914 -1.311132669  0.8220594  2.372950077 -4.194246 -1.460474 -9.101704 -6.663676 -5.364724 -2.717272 -3.682574
5 2915 -1.311132669  0.8220594  2.372950077 -4.194246 -1.460474 -9.101704 -6.663676 -5.364724 -2.717272 -3.682574
6 1261  2.372950077 -0.7022792 -4.951318264 -4.194246 -1.460474 -9.101704 -6.663676 -5.364724 -2.717272 -3.682574

“sim_vcr*”变量一直到 sim_vcr100

我需要在一个图中包含两条重叠的密度密度曲线，看起来像这样（除了这里你看到的是 5 而不是 2）：

我需要一条密度曲线包含列 vcr1、vcr2 和 vcr3 中包含的所有值，我需要另一条密度曲线包含所有 sim_vcr* 列中的所有值（因此 100列，sim_vcr1-sim_vcr100)

因为两条曲线重叠，所以它们需要是透明的，如附图所示。我知道有一种非常简单的方法可以使用 ggplot 命令来执行此操作，但我在语法方面遇到了问题，并且无法正确定位数据框，以便每个直方图都来自正确的列。

非常感谢任何帮助。

Answer 1

df 是您在 post 中提到的数据，您可以试试这个：

用下一个代码分离数据帧，然后绘图：

library(tidyverse)
library(gdata)
#Index
i1 <- which(startsWith(names(df),pattern = 'vcr'))
i2 <- which(startsWith(names(df),pattern = 'sim'))
#Isolate
df1 <- df[,c(1,i1)]
df2 <- df[,c(1,i2)]
#Melt
M1 <- pivot_longer(df1,cols = names(df1)[-1])
M2 <- pivot_longer(df2,cols = names(df2)[-1])
#Plot 1
ggplot(M1) + geom_density(aes(x=value,fill=name), alpha=.5)
#Plot 2
ggplot(M2) + geom_density(aes(x=value,fill=name), alpha=.5)

更新

对一个图使用下一个代码：

#Unique plot
#Melt
M <- pivot_longer(df,cols = names(df)[-1])
#Mutate
M$var <- ifelse(startsWith(M$name,'vcr',),'vcr','sim_vcr')
#Plot 3
ggplot(M) + geom_density(aes(x=value,fill=var), alpha=.5)

Answer 2

使用 dplyr 包，首先您可以使用函数 pivot_longer 将数据转换为长格式，如下所示：

df %<>% pivot_longer(cols = c(starts_with('vcr'), starts_with('sim_vcr')),
                         names_to = c('type'),
                         values_to = c('values'))

使用 filter 函数后，您可以为每个值类型创建单独的图对于 vcr 列：

df %>% 
  filter(str_detect(type, '^vcr')) %>%
  ggplot(.) +
  geom_density(aes(x = values, fill = type), alpha = 0.5)

以上产生了以下情节：对于 sim_vcr 列：

df %>%
  filter(str_detect(type, '^sim_vcr')) %>%
  ggplot(.) +
  geom_density(aes(x = values, fill = type), alpha = 0.5)

以上代码产生了以下情节：

Answer 3

另一种为 ggplot 子集和准备数据的简单方法是使用 tidyr 的 gather()，您可以阅读更多相关信息。我是这样做的。 df 是您提供的数据框。

# Load tidyr to use gather()
library(tidyr)

#Split appart the data you dont want on their own, the first three columns, and gather them
df_vcr <- gather(data = df[,2:4])

#Gather the other columns in the dataframe
df_sim<- gather(data = df[,-c(1:4)])

#Plot the first
ggplot() + 
  geom_density(data = df_vcr, 
               mapping = aes(value, group = key, color = key, fill = key),
               alpha = 0.5)
#Plot the second
ggplot() + 
  geom_density(data = df_sim,
               mapping = aes(value, group = key, color = key, fill = key),
               alpha = 0.5)

不过，我不太清楚您所说的“所有 sim_vcr* 列中的所有值”是什么意思。也许您希望所有这些值都在一条密度曲线中？要做到这一点，在第二种情况下不要给 ggplot 任何分组信息。

ggplot() + geom_density(data = df_sim,
           mapping = aes(value),
           fill = "grey50",
           alpha = 0.5)

请注意，我仍然可以在 aes() 函数之外为曲线指定 'fill'，它会将它应用于所有曲线，而不是为 'key' 中指定的每个组指定不同的颜色.

使用 ggplot 绘制两条重叠的密度曲线

Plotting two overlapping density curves using ggplot

plot

r

ggplot2

dataframe

density-plot