如何计算 R 中组中位数之间的列差异
How to calculate column-wise difference between group medians in R
我有 data.frame df1
:
set.seed(12345)
df1 <- data.frame(group=c(rep("apple", 4), rep("pear",6)), a=rnorm(10,0,0.4),
b=rnorm(10,0,0.2),
c=rnorm(10,0,0.7), d=rnorm(10,0,0.9), e=rnorm(10,0,0.5))
我如何按列计算苹果组(行 1:4)和梨组(行 5:10)的中位数之间的差异,并将该差异添加到新行中底部,导致 df2:
df2 <- data.frame(group=c(rep("apple", 4), rep("pear",6), "median.dif"),
a=c(rnorm(10,0,0.4), 0.20731387), b=c(rnorm(10,0,0.2), 0.09236982),
c=c(rnorm(10,0,0.7), -0.11008165), d=c(rnorm(10,0,0.9), 1.530703685),
e=c(rnorm(10,0,0.5), -0.31842895))
> df2
group a b c d e
1 apple 0.23421153 -0.02324956 0.5457353 0.73068586 0.5642554
2 apple 0.28378641 0.36346241 1.0190496 1.97715019 -1.190179
3 apple -0.04372133 0.07412557 -0.4510299 1.8442713 -0.5301328
4 apple -0.18139887 0.10404329 -1.0871962 1.46920108 0.4685703
5 pear 0.24235498 -0.1501064 -1.1183967 0.22884407 0.4272259
6 pear -0.72718239 0.16337997 1.2635683 0.44206945 0.7303647
7 pear 0.25203942 -0.1772715 -0.3371532 -0.29167792 -0.7065494
8 pear -0.11047364 -0.06631552 0.4342659 -1.49584522 0.2837016
9 pear -0.1136639 0.22414253 0.4284864 1.59096047 0.2915938
10 pear -0.3677288 0.05974474 -0.1136177 0.02322094 -0.6533994
11 median.dif 0.20731387 0.09236982 -0.1100817 1.5307037 -0.3184289
我们可以用 dplyr 来完成。 summarise
按组获取数据的差异,然后bind_rows
到原始数据帧。
library(dplyr)
df1 %>% summarise(across(!group, ~median(.[group=='apple'])-
median(.[group=='pear']))) %>%
bind_rows(df1, .)%>%
mutate(group=replace(group, nrow(.), 'median.dif'))
group a b c d e
1 apple 0.23421153 -0.02324956 0.5457353 0.73068586 0.5642554
2 apple 0.28378641 0.36346241 1.0190496 1.97715019 -1.1901790
3 apple -0.04372133 0.07412557 -0.4510299 1.84427130 -0.5301328
4 apple -0.18139887 0.10404329 -1.0871962 1.46920108 0.4685703
5 pear 0.24235498 -0.15010640 -1.1183967 0.22884407 0.4272259
6 pear -0.72718239 0.16337997 1.2635683 0.44206945 0.7303647
7 pear 0.25203942 -0.17727150 -0.3371532 -0.29167792 -0.7065494
8 pear -0.11047364 -0.06631552 0.4342659 -1.49584522 0.2837016
9 pear -0.11366390 0.22414253 0.4284864 1.59096047 0.2915938
10 pear -0.36772880 0.05974474 -0.1136177 0.02322094 -0.6533994
11 median.dif 0.20731387 0.09236982 -0.1100817 1.53070368 -0.3184290
这是基本的 R 方式。
tapply
函数 median
到每一列 x
(见下面的第 3 点),按列分组 group
;
- 每列中有两组,因此
diff
计算两个中位数的差值;
- 上面的第1点和第2点是内循环。
sapply
作为匿名函数 \(x)
循环到 df1[-1]
的每一列。这使用了 R 4.1.0 中引入的新 lambda \(x)
。对于旧语法,使用 function(x)
.
那么代码如下
new <- sapply(df1[-1], \(x)diff(tapply(x, df1$group, median, na.rm = TRUE)))
names(new) <- sub("\..*$", "", names(new))
new <- cbind(data.frame(group = "median.diff"), t(new))
df1 <- rbind(df1, new)
我有 data.frame df1
:
set.seed(12345)
df1 <- data.frame(group=c(rep("apple", 4), rep("pear",6)), a=rnorm(10,0,0.4),
b=rnorm(10,0,0.2),
c=rnorm(10,0,0.7), d=rnorm(10,0,0.9), e=rnorm(10,0,0.5))
我如何按列计算苹果组(行 1:4)和梨组(行 5:10)的中位数之间的差异,并将该差异添加到新行中底部,导致 df2:
df2 <- data.frame(group=c(rep("apple", 4), rep("pear",6), "median.dif"),
a=c(rnorm(10,0,0.4), 0.20731387), b=c(rnorm(10,0,0.2), 0.09236982),
c=c(rnorm(10,0,0.7), -0.11008165), d=c(rnorm(10,0,0.9), 1.530703685),
e=c(rnorm(10,0,0.5), -0.31842895))
> df2
group a b c d e
1 apple 0.23421153 -0.02324956 0.5457353 0.73068586 0.5642554
2 apple 0.28378641 0.36346241 1.0190496 1.97715019 -1.190179
3 apple -0.04372133 0.07412557 -0.4510299 1.8442713 -0.5301328
4 apple -0.18139887 0.10404329 -1.0871962 1.46920108 0.4685703
5 pear 0.24235498 -0.1501064 -1.1183967 0.22884407 0.4272259
6 pear -0.72718239 0.16337997 1.2635683 0.44206945 0.7303647
7 pear 0.25203942 -0.1772715 -0.3371532 -0.29167792 -0.7065494
8 pear -0.11047364 -0.06631552 0.4342659 -1.49584522 0.2837016
9 pear -0.1136639 0.22414253 0.4284864 1.59096047 0.2915938
10 pear -0.3677288 0.05974474 -0.1136177 0.02322094 -0.6533994
11 median.dif 0.20731387 0.09236982 -0.1100817 1.5307037 -0.3184289
我们可以用 dplyr 来完成。 summarise
按组获取数据的差异,然后bind_rows
到原始数据帧。
library(dplyr)
df1 %>% summarise(across(!group, ~median(.[group=='apple'])-
median(.[group=='pear']))) %>%
bind_rows(df1, .)%>%
mutate(group=replace(group, nrow(.), 'median.dif'))
group a b c d e
1 apple 0.23421153 -0.02324956 0.5457353 0.73068586 0.5642554
2 apple 0.28378641 0.36346241 1.0190496 1.97715019 -1.1901790
3 apple -0.04372133 0.07412557 -0.4510299 1.84427130 -0.5301328
4 apple -0.18139887 0.10404329 -1.0871962 1.46920108 0.4685703
5 pear 0.24235498 -0.15010640 -1.1183967 0.22884407 0.4272259
6 pear -0.72718239 0.16337997 1.2635683 0.44206945 0.7303647
7 pear 0.25203942 -0.17727150 -0.3371532 -0.29167792 -0.7065494
8 pear -0.11047364 -0.06631552 0.4342659 -1.49584522 0.2837016
9 pear -0.11366390 0.22414253 0.4284864 1.59096047 0.2915938
10 pear -0.36772880 0.05974474 -0.1136177 0.02322094 -0.6533994
11 median.dif 0.20731387 0.09236982 -0.1100817 1.53070368 -0.3184290
这是基本的 R 方式。
tapply
函数median
到每一列x
(见下面的第 3 点),按列分组group
;- 每列中有两组,因此
diff
计算两个中位数的差值; - 上面的第1点和第2点是内循环。
sapply
作为匿名函数\(x)
循环到df1[-1]
的每一列。这使用了 R 4.1.0 中引入的新 lambda\(x)
。对于旧语法,使用function(x)
.
那么代码如下
new <- sapply(df1[-1], \(x)diff(tapply(x, df1$group, median, na.rm = TRUE)))
names(new) <- sub("\..*$", "", names(new))
new <- cbind(data.frame(group = "median.diff"), t(new))
df1 <- rbind(df1, new)