如何计算 R 中组中位数之间的列差异

How to calculate column-wise difference between group medians in R

我有 data.frame df1:

 set.seed(12345)
 df1 <- data.frame(group=c(rep("apple", 4), rep("pear",6)), a=rnorm(10,0,0.4), 
      b=rnorm(10,0,0.2), 
      c=rnorm(10,0,0.7), d=rnorm(10,0,0.9), e=rnorm(10,0,0.5))

我如何按列计算苹果组(行 1:4)和梨组(行 5:10)的中位数之间的差异,并将该差异添加到新行中底部,导致 df2:

 df2 <- data.frame(group=c(rep("apple", 4), rep("pear",6), "median.dif"), 
      a=c(rnorm(10,0,0.4), 0.20731387), b=c(rnorm(10,0,0.2), 0.09236982), 
      c=c(rnorm(10,0,0.7), -0.11008165), d=c(rnorm(10,0,0.9), 1.530703685), 
      e=c(rnorm(10,0,0.5), -0.31842895))

 > df2
    group   a   b   c   d   e
 1  apple   0.23421153  -0.02324956 0.5457353   0.73068586  0.5642554
 2  apple   0.28378641  0.36346241  1.0190496   1.97715019  -1.190179
 3  apple   -0.04372133 0.07412557  -0.4510299  1.8442713   -0.5301328
 4  apple   -0.18139887 0.10404329  -1.0871962  1.46920108  0.4685703
 5  pear    0.24235498  -0.1501064  -1.1183967  0.22884407  0.4272259
 6  pear    -0.72718239 0.16337997  1.2635683   0.44206945  0.7303647
 7  pear    0.25203942  -0.1772715  -0.3371532  -0.29167792 -0.7065494
 8  pear    -0.11047364 -0.06631552 0.4342659   -1.49584522 0.2837016
 9  pear    -0.1136639  0.22414253  0.4284864   1.59096047  0.2915938
 10 pear    -0.3677288  0.05974474  -0.1136177  0.02322094  -0.6533994
 11 median.dif  0.20731387  0.09236982 -0.1100817  1.5307037 -0.3184289

我们可以用 dplyr 来完成。 summarise按组获取数据的差异,然后bind_rows到原始数据帧。

library(dplyr)

df1 %>% summarise(across(!group, ~median(.[group=='apple'])-
                                 median(.[group=='pear']))) %>%
        bind_rows(df1, .)%>%
        mutate(group=replace(group, nrow(.), 'median.dif'))

        group           a           b          c           d          e
1       apple  0.23421153 -0.02324956  0.5457353  0.73068586  0.5642554
2       apple  0.28378641  0.36346241  1.0190496  1.97715019 -1.1901790
3       apple -0.04372133  0.07412557 -0.4510299  1.84427130 -0.5301328
4       apple -0.18139887  0.10404329 -1.0871962  1.46920108  0.4685703
5        pear  0.24235498 -0.15010640 -1.1183967  0.22884407  0.4272259
6        pear -0.72718239  0.16337997  1.2635683  0.44206945  0.7303647
7        pear  0.25203942 -0.17727150 -0.3371532 -0.29167792 -0.7065494
8        pear -0.11047364 -0.06631552  0.4342659 -1.49584522  0.2837016
9        pear -0.11366390  0.22414253  0.4284864  1.59096047  0.2915938
10       pear -0.36772880  0.05974474 -0.1136177  0.02322094 -0.6533994
11 median.dif  0.20731387  0.09236982 -0.1100817  1.53070368 -0.3184290

这是基本的 R 方式。

  1. tapply 函数 median 到每一列 x (见下面的第 3 点),按列分组 group;
  2. 每列中有两组,因此diff计算两个中位数的差值;
  3. 上面的第1点和第2点是内循环。 sapply 作为匿名函数 \(x) 循环到 df1[-1] 的每一列。这使用了 R 4.1.0 中引入的新 lambda \(x)。对于旧语法,使用 function(x).

那么代码如下

new <- sapply(df1[-1], \(x)diff(tapply(x, df1$group, median, na.rm = TRUE)))
names(new) <- sub("\..*$", "", names(new))
new <- cbind(data.frame(group = "median.diff"), t(new))
df1 <- rbind(df1, new)