如何将条形图中的标准偏差条限制为最大值?

How do I restrict the standard deviation bars in my barplot to a maximum value?

我正在使用 ggplot2 创建带有标准偏差条的条形图。我的数据框很大,但这里有一个截断的版本,例如:

SampleName  Target.ID   Maj.Allele.Freq SD  AVG.MAF
W15-P2-1    rs1005533   99.74811083 24.98883743 93.70753223
W15-P2-2    rs1005533   100 24.98883743 93.70753223
W15-P2-3    rs1005533   100 24.98883743 93.70753223
W15-P2-4    rs1005533   100 24.98883743 93.70753223
W15-P2-1    rs1005533   99.94819995 24.98883743 93.70753223
W15-P2-2    rs1005533   100 24.98883743 93.70753223
W15-P2-3    rs1005533   100 24.98883743 93.70753223
W15-P2-4    rs1005533   100 24.98883743 93.70753223
W21-P2-1    rs1005533   100 24.98883743 93.70753223
W21-P2-2    rs1005533   100 24.98883743 93.70753223
W21-P2-3    rs1005533   99.90044798 24.98883743 93.70753223
W21-P2-4    rs1005533   99.72375691 24.98883743 93.70753223
W21-P2-1    rs1005533   100 24.98883743 93.70753223
W21-P2-2    rs1005533   100 24.98883743 93.70753223
W21-P2-3    rs1005533   100 24.98883743 93.70753223
W21-P2-4    rs1005533   0   24.98883743 93.70753223
W15-P2-1    rs10092491  52.40641711 1.340954343 51.8604281
W15-P2-2    rs10092491  53.69923603 1.340954343 51.8604281
W15-P2-3    rs10092491  52.56689284 1.340954343 51.8604281
W15-P2-4    rs10092491  50.11764706 1.340954343 51.8604281
W15-P2-1    rs10092491  50.30094583 1.340954343 51.8604281
W15-P2-2    rs10092491  50.96277279 1.340954343 51.8604281
W15-P2-3    rs10092491  50.94102886 1.340954343 51.8604281
W15-P2-4    rs10092491  51.2849162  1.340954343 51.8604281
W21-P2-1    rs10092491  53.56976202 1.340954343 51.8604281
W21-P2-2    rs10092491  50.27861123 1.340954343 51.8604281
W21-P2-3    rs10092491  52.8358209  1.340954343 51.8604281
W21-P2-4    rs10092491  51.42585551 1.340954343 51.8604281
W21-P2-1    rs10092491  52.77890467 1.340954343 51.8604281
W21-P2-2    rs10092491  52.89017341 1.340954343 51.8604281
W21-P2-3    rs10092491  53.70786517 1.340954343 51.8604281
W21-P2-4    rs10092491  50  1.340954343 51.8604281

因为最后一列中的平均值 (AVG.MAF) 可以产生超过最大值 100 的标准偏差条,该图显示了超出 y 轴 100 限制的条。

下面是创建上述图的代码:

pe1 = ggplot(half1, aes(x=Target.ID, y=AVG.MAF))+
 geom_bar(stat = "identity", position = "dodge", colour = "black", 
 width = 0.5, fill = "yellowgreen")+xlab("")+
 ylab("Average Major Allele Frequency")+
 labs(title="Allele Balance AmpliSeq Identity Sample P2")+
 geom_errorbar(aes(ymin = AVG.MAF-SD, ymax = AVG.MAF+SD), 
 width = 0.4, position = position_dodge(0.9), 
   size = 0.6)+
 theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))

我尝试使用 coord_cartesian 截断情节,但这种情节看起来像是我隐藏了一些数据:

以下是创建标准差条截断图的代码:

pe1 = ggplot(half1, aes(x=Target.ID, y=AVG.MAF))+geom_bar(stat = "identity", position = "dodge", colour = "black", width = 0.5, fill = "yellowgreen")+xlab("")+ylab("Average Major Allele Frequency")+labs(title="Allele Balance AmpliSeq Identity Sample P2")+geom_errorbar(aes(ymin = AVG.MAF-SD, ymax = AVG.MAF+SD), width = 0.4, position = position_dodge(0.9), size = 0.6)+theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))+coord_cartesian(ylim=c(0,100))

似乎必须有一种方法可以将标准偏差条限制为我预期的 ymax 100,并且仍然保持顶部水平条在图中可见。有人知道怎么做吗?

除了人们在评论中提出的问题外,还有一些其他注意事项:

  1. 您不需要为数据的每一行添加重复均值的列。相反,您可以使用 Maj.Allele.Freq 中的实际数据值在 ggplot 中计算和绘制平均值。 (事实上​​ ,通过为每个 Target.ID 一遍又一遍地重复平均值的 y 值列,您实际上是在绘制均值条的多个副本,一个在另一个之上。)

    您还可以在 ggplot 之外汇总数据(即计算均值和标准差),然后使用汇总的数据框进行绘图。在更复杂的情况下,这有时是必要的,但您可以在此处的 ggplot 中完成所有操作。

  2. 在我看来,点数在这里比柱状图更有效。

下面的代码提供了点和条形版本,还显示了如何添加数据的标准偏差或数据均值的 95% 置信区间。蓝线代表标准偏差,而红线代表 95% 置信区间。

我提供了自举置信区间。要提供经典正态置信区间,请从 mean_cl_boot 切换到 mean_cl_normal

如果您希望 y 轴下降到零,请添加 coord_cartesian(ylim=c(0,150)) 或您希望的任何最大 y 值(正如评论所讨论的那样,为了避免误导性图表,它应该在顶部上方误差条,无论该条代表 SD 还是 CI)。

ggplot(half1, aes(x=Target.ID, y=Maj.Allele.Freq)) +
  stat_summary(fun.data=mean_sdl, geom="errorbar", width=0.1, colour="blue") +
  stat_summary(fun.data=mean_sdl, geom="point", colour="blue", size=3) +
  stat_summary(fun.data = mean_cl_boot, colour="red", geom="errorbar", width=0.1) +
  stat_summary(fun.data = mean_cl_boot, colour="red", geom="point") +
  labs(x="", y="Average Major Allele Frequency", 
       title="Allele Balance AmpliSeq\nIdentity Sample P2") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5)) 

ggplot(half1, aes(x=Target.ID, y=Maj.Allele.Freq)) +
  stat_summary(fun.y=mean, geom="bar", fill="yellowgreen", colour="black") +
  stat_summary(fun.data=mean_sdl, geom="errorbar", width=0.1, size=1, colour="blue") +
  stat_summary(fun.data = mean_cl_boot, colour="red", geom="errorbar", width=0.1, size=0.7) +
  labs(x="", y="Average Major Allele Frequency", 
       title="Allele Balance AmpliSeq\nIdentity Sample P2") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))   

您也可以将 SD 和 95% CI 放在同一个图上:

pnp = position_nudge(x=0.1)
pnm = position_nudge(x=-0.1)

ggplot(half1, aes(x=Target.ID, y=Maj.Allele.Freq)) +
  stat_summary(fun.data=mean_sdl, geom="errorbar", width=0.1, position=pnp, aes(colour="SD")) +
  stat_summary(fun.data=mean_sdl, geom="point", position=pnp, aes(colour="SD")) +
  stat_summary(fun.data = mean_cl_boot, geom="errorbar", width=0.1, 
               position=pnm, aes(colour="95% CI")) +
  stat_summary(fun.data = mean_cl_boot, geom="point", position=pnm, aes(colour="95% CI")) +
  labs(x="", y="Average Major Allele Frequency", colour="",
       title="Allele Balance AmpliSeq\nIdentity Sample P2") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))