ggplot2 Boxplot 显示与计算不同的中位数

Question

我正在根据大数据（2150000 例）绘制一个简单的两组按年体重的箱线图。除去年的最后一组外，所有组的中位数都相同，但在箱线图上，它被绘制成与其他组相同。

 #boxplot
ggplot(dataset, aes(x=Year, y=SUM_MME_mg, fill=GenderPerson)) + 
  geom_boxplot(outlier.shape = NA)+
  ylim(0,850)


#median by group
pivot <- dataset %>%
  select(SUM_MME_mg,GenderPerson,Year )%>%
  group_by(Year, GenderPerson) %>%
  summarise(MedianValues = median(SUM_MME_mg,na.rm=TRUE))

我搞不懂我做错了什么，或者在箱线图计算或中值函数中哪些数据更准确。 R returns 没有错误或警告。

 #my data:
> dput(head(dataset[,c(1,7,10)]))
structure(list(GenderPerson = c(2L, 1L, 2L, 2L, 2L, 2L), Year = c("2015", 
"2014", "2013", "2012", "2011", "2015"), SUM_MME_mg = c(416.16, 
131.76, 790.56, 878.4, 878.4, 878.4)), row.names = c(NA, 6L), class = "data.frame")

Answer 1

此行为的原因与 ylim() 的运作方式有关。 ylim() 是 scale_y_continuous(limits=... 的便利 function/wrapper。如果您 look into the documentation 用于 scale_continuous 函数，您会发现设置限制不仅会放大某个区域，而且实际上 也会删除该区域之外的所有数据点。这发生在 computation/stat 函数之前，所以这就是使用 ylim() 时中位数不同的原因。您的计算 "outside" ggplot() 正在使用整个数据集，而使用 ylim() 意味着在进行计算之前删除数据点。

幸运的是，有一个简单的解决方法，即使用 coord_cartesian(ylim=...) 代替 ylim()，因为 coord_cartesian() 只会放大数据而不会删除数据点。看这里的区别：

ggplot(dataset, aes(x=Year, y=SUM_MME_mg, fill=GenderPerson)) + 
  geom_boxplot(outlier.shape = NA) + ylim(0,850)

ggplot(dataset, aes(x=Year, y=SUM_MME_mg, fill=GenderPerson)) + 
  geom_boxplot(outlier.shape = NA) + coord_cartesian(ylim=c(0,850))

此行为的提示也应该很明显，因为使用 ylim() 的第一个代码块也应该给您一条警告消息：

Warning message:
Removed 3 rows containing non-finite values (stat_boxplot).

而第二个使用 coord_cartesian(ylim= 则不会。

ggplot2 Boxplot 显示与计算不同的中位数

ggplot2 Boxplot displays different median than calculated

r

median

ggplot2

boxplot

dplyr