计算行的中位数和均值(在 R 中)
Calculating the Medians and Means of Rows (in R)
我正在使用 R 编程语言。假设我有以下数据 ("my_data"):
student first_run second_run third_run fourth_run fifth_run sixth_run seventh_run eight_run ninth_run tenth_run
1 student1 19.70847 21.79771 16.49083 19.51691 13.97987 14.60733 13.89703 15.24651 20.75679 18.44020
2 student2 11.22369 15.36253 16.90215 20.20724 15.90227 15.14539 13.74945 18.30090 19.55124 17.24132
3 student3 15.93649 17.03599 14.20214 13.17548 14.70327 15.49697 13.08945 19.94142 22.41674 17.37958
4 student4 16.18733 15.13197 14.79481 16.75177 14.51287 17.71816 13.45054 14.25553 19.89091 18.88981
5 student5 18.71084 18.85453 17.15864 19.38880 15.68862 18.39169 15.26428 16.04526 18.92532 16.62409
6 student6 19.75246 12.74605 18.52214 17.92626 14.48501 17.20780 13.10512 12.46502 20.68583 15.87711
7 student7 14.75144 23.82376 18.51366 20.77424 14.22155 16.08186 12.95981 12.67820 20.12166 15.66006
8 student8 17.06516 15.63075 13.72026 15.02068 14.21098 15.99414 14.64818 16.15603 21.74607 17.07382
9 student9 20.27611 12.44592 12.26502 15.13456 14.61552 18.72192 15.11129 17.60746 18.83831 17.55257
10 student10 17.70736 16.21620 14.10861 17.20014 16.59376 19.50027 13.05073 15.80002 18.09781 18.34313
我想向此数据添加 2 列:
- my_mean:每行的平均值
- my_median:每行的中位数
我在 R 中尝试了以下代码:
my_data$median = apply(my_data, 1, median, na.rm=T)
my_data$mean = apply(my_data, 1, mean, na.rm=T)
但我认为这段代码不正确。例如,使用此代码时,第二行数据的中位数返回为“16.90215”
但是当我手动取这一行的中位数时:
median(11.22369 , 15.36253 , 16.90215 , 20.20724, 15.90227 , 15.14539 , 13.74945 , 18.30090 , 19.55124 , 17.24132)
我得到了
的答案
11.22
有人可以告诉我我做错了什么吗?
谢谢
library(dplyr)
df %>%
rowwise() %>%
mutate(median = median(c_across(where(is.numeric))),
mean = mean(c_across(where(is.numeric))))
c_across
和 rowwise
就是为这种情况创建的。大多数动词按列工作。要先将此行为管道更改为 rowwise
。
c_across
然后会将一行中的所有值组合为数字(因此 where(is.numeric)
到一个数字向量中,然后可以应用 mean
或 median
。
注意:您可能希望将输出传输到 ungroup
,因为 rowwise
创建了按行分组的数据框。
计算不正确,即 median
的第一个参数是 'x',它可以是一个向量。第二个参数是 na.rm
,后跟可变参数 ...
。所以,当写11.22369, 15.36253
时,'x'被当作11.22369
,这就是返回值。相反,它应该是一个串联的向量 c
median(c(11.22369 , 15.36253 , 16.90215 , 20.20724, 15.90227 , 15.14539 , 13.74945 , 18.30090 , 19.55124 , 17.24132))
[1] 16.40221
此外,根据 OP 的数据,应该删除第一列,即字符或因子
apply(my_data[-1], 1, median, na.rm=TRUE)
1 2 3 4 5 6 7 8 9 10
17.46551 16.40221 15.71673 15.65965 17.77517 16.54246 15.87096 15.81245 16.34356 16.89695
第二行用于manual
计算
这是使用 pmap
的替代方法,同时传递所有参数,因此使用省略号,即 ...
。输出需要与 tidyr
:
中的 unnest_wider
解除嵌套
library(tidyr)
library(dplyr)
library(purrr)
df %>%
mutate(res = pmap(across(where(is.numeric)),
~ list(median = median(c(...)),
avg = mean(c(...))))) %>%
unnest_wider(res)
输出:
student first_run second_run third_run fourth_run fifth_run sixth_run seventh_run eight_run ninth_run tenth_run median avg
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 student1 19.7 21.8 16.5 19.5 14.0 14.6 13.9 15.2 20.8 18.4 17.5 17.4
2 student2 11.2 15.4 16.9 20.2 15.9 15.1 13.7 18.3 19.6 17.2 16.4 16.4
3 student3 15.9 17.0 14.2 13.2 14.7 15.5 13.1 19.9 22.4 17.4 15.7 16.3
4 student4 16.2 15.1 14.8 16.8 14.5 17.7 13.5 14.3 19.9 18.9 15.7 16.2
5 student5 18.7 18.9 17.2 19.4 15.7 18.4 15.3 16.0 18.9 16.6 17.8 17.5
6 student6 19.8 12.7 18.5 17.9 14.5 17.2 13.1 12.5 20.7 15.9 16.5 16.3
7 student7 14.8 23.8 18.5 20.8 14.2 16.1 13.0 12.7 20.1 15.7 15.9 17.0
8 student8 17.1 15.6 13.7 15.0 14.2 16.0 14.6 16.2 21.7 17.1 15.8 16.1
9 student9 20.3 12.4 12.3 15.1 14.6 18.7 15.1 17.6 18.8 17.6 16.3 16.3
10 student10 17.7 16.2 14.1 17.2 16.6 19.5 13.1 15.8 18.1 18.3 16.9 16.7
您绝对可以受益于 matrixStats
库的速度。
matrixStats::rowMedians(as.matrix(d[-1]))
# [1] 17.46551 16.40221 15.71673 15.65965 17.77517 16.54246 15.87096 15.81245 16.34356 16.89695
matrixStats::rowMeans2(as.matrix(d[-1]))
# [1] 17.44417 16.35862 16.33775 16.15837 17.50521 16.27728 16.95862 16.12661 16.25687 16.66180
stopifnot(all.equal(matrixStats::rowMedians(as.matrix(d[-1])),
as.numeric(apply(d[-1], 1, median, na.rm=T))))
stopifnot(all.equal(matrixStats::rowMeans2(as.matrix(d[-1])),
as.numeric(apply(d[-1], 1, mean, na.rm=T))))
数据:
d <- structure(list(student = c("student1", "student2", "student3",
"student4", "student5", "student6", "student7", "student8", "student9",
"student10"), first_run = c(19.70847, 11.22369, 15.93649, 16.18733,
18.71084, 19.75246, 14.75144, 17.06516, 20.27611, 17.70736),
second_run = c(21.79771, 15.36253, 17.03599, 15.13197, 18.85453,
12.74605, 23.82376, 15.63075, 12.44592, 16.2162), third_run = c(16.49083,
16.90215, 14.20214, 14.79481, 17.15864, 18.52214, 18.51366,
13.72026, 12.26502, 14.10861), fourth_run = c(19.51691, 20.20724,
13.17548, 16.75177, 19.3888, 17.92626, 20.77424, 15.02068,
15.13456, 17.20014), fifth_run = c(13.97987, 15.90227, 14.70327,
14.51287, 15.68862, 14.48501, 14.22155, 14.21098, 14.61552,
16.59376), sixth_run = c(14.60733, 15.14539, 15.49697, 17.71816,
18.39169, 17.2078, 16.08186, 15.99414, 18.72192, 19.50027
), seventh_run = c(13.89703, 13.74945, 13.08945, 13.45054,
15.26428, 13.10512, 12.95981, 14.64818, 15.11129, 13.05073
), eight_run = c(15.24651, 18.3009, 19.94142, 14.25553, 16.04526,
12.46502, 12.6782, 16.15603, 17.60746, 15.80002), ninth_run = c(20.75679,
19.55124, 22.41674, 19.89091, 18.92532, 20.68583, 20.12166,
21.74607, 18.83831, 18.09781), tenth_run = c(18.4402, 17.24132,
17.37958, 18.88981, 16.62409, 15.87711, 15.66006, 17.07382,
17.55257, 18.34313)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
我正在使用 R 编程语言。假设我有以下数据 ("my_data"):
student first_run second_run third_run fourth_run fifth_run sixth_run seventh_run eight_run ninth_run tenth_run
1 student1 19.70847 21.79771 16.49083 19.51691 13.97987 14.60733 13.89703 15.24651 20.75679 18.44020
2 student2 11.22369 15.36253 16.90215 20.20724 15.90227 15.14539 13.74945 18.30090 19.55124 17.24132
3 student3 15.93649 17.03599 14.20214 13.17548 14.70327 15.49697 13.08945 19.94142 22.41674 17.37958
4 student4 16.18733 15.13197 14.79481 16.75177 14.51287 17.71816 13.45054 14.25553 19.89091 18.88981
5 student5 18.71084 18.85453 17.15864 19.38880 15.68862 18.39169 15.26428 16.04526 18.92532 16.62409
6 student6 19.75246 12.74605 18.52214 17.92626 14.48501 17.20780 13.10512 12.46502 20.68583 15.87711
7 student7 14.75144 23.82376 18.51366 20.77424 14.22155 16.08186 12.95981 12.67820 20.12166 15.66006
8 student8 17.06516 15.63075 13.72026 15.02068 14.21098 15.99414 14.64818 16.15603 21.74607 17.07382
9 student9 20.27611 12.44592 12.26502 15.13456 14.61552 18.72192 15.11129 17.60746 18.83831 17.55257
10 student10 17.70736 16.21620 14.10861 17.20014 16.59376 19.50027 13.05073 15.80002 18.09781 18.34313
我想向此数据添加 2 列:
- my_mean:每行的平均值
- my_median:每行的中位数
我在 R 中尝试了以下代码:
my_data$median = apply(my_data, 1, median, na.rm=T)
my_data$mean = apply(my_data, 1, mean, na.rm=T)
但我认为这段代码不正确。例如,使用此代码时,第二行数据的中位数返回为“16.90215”
但是当我手动取这一行的中位数时:
median(11.22369 , 15.36253 , 16.90215 , 20.20724, 15.90227 , 15.14539 , 13.74945 , 18.30090 , 19.55124 , 17.24132)
我得到了
的答案11.22
有人可以告诉我我做错了什么吗?
谢谢
library(dplyr)
df %>%
rowwise() %>%
mutate(median = median(c_across(where(is.numeric))),
mean = mean(c_across(where(is.numeric))))
c_across
和 rowwise
就是为这种情况创建的。大多数动词按列工作。要先将此行为管道更改为 rowwise
。
c_across
然后会将一行中的所有值组合为数字(因此 where(is.numeric)
到一个数字向量中,然后可以应用 mean
或 median
。
注意:您可能希望将输出传输到 ungroup
,因为 rowwise
创建了按行分组的数据框。
计算不正确,即 median
的第一个参数是 'x',它可以是一个向量。第二个参数是 na.rm
,后跟可变参数 ...
。所以,当写11.22369, 15.36253
时,'x'被当作11.22369
,这就是返回值。相反,它应该是一个串联的向量 c
median(c(11.22369 , 15.36253 , 16.90215 , 20.20724, 15.90227 , 15.14539 , 13.74945 , 18.30090 , 19.55124 , 17.24132))
[1] 16.40221
此外,根据 OP 的数据,应该删除第一列,即字符或因子
apply(my_data[-1], 1, median, na.rm=TRUE)
1 2 3 4 5 6 7 8 9 10
17.46551 16.40221 15.71673 15.65965 17.77517 16.54246 15.87096 15.81245 16.34356 16.89695
第二行用于manual
计算
这是使用 pmap
的替代方法,同时传递所有参数,因此使用省略号,即 ...
。输出需要与 tidyr
:
unnest_wider
解除嵌套
library(tidyr)
library(dplyr)
library(purrr)
df %>%
mutate(res = pmap(across(where(is.numeric)),
~ list(median = median(c(...)),
avg = mean(c(...))))) %>%
unnest_wider(res)
输出:
student first_run second_run third_run fourth_run fifth_run sixth_run seventh_run eight_run ninth_run tenth_run median avg
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 student1 19.7 21.8 16.5 19.5 14.0 14.6 13.9 15.2 20.8 18.4 17.5 17.4
2 student2 11.2 15.4 16.9 20.2 15.9 15.1 13.7 18.3 19.6 17.2 16.4 16.4
3 student3 15.9 17.0 14.2 13.2 14.7 15.5 13.1 19.9 22.4 17.4 15.7 16.3
4 student4 16.2 15.1 14.8 16.8 14.5 17.7 13.5 14.3 19.9 18.9 15.7 16.2
5 student5 18.7 18.9 17.2 19.4 15.7 18.4 15.3 16.0 18.9 16.6 17.8 17.5
6 student6 19.8 12.7 18.5 17.9 14.5 17.2 13.1 12.5 20.7 15.9 16.5 16.3
7 student7 14.8 23.8 18.5 20.8 14.2 16.1 13.0 12.7 20.1 15.7 15.9 17.0
8 student8 17.1 15.6 13.7 15.0 14.2 16.0 14.6 16.2 21.7 17.1 15.8 16.1
9 student9 20.3 12.4 12.3 15.1 14.6 18.7 15.1 17.6 18.8 17.6 16.3 16.3
10 student10 17.7 16.2 14.1 17.2 16.6 19.5 13.1 15.8 18.1 18.3 16.9 16.7
您绝对可以受益于 matrixStats
库的速度。
matrixStats::rowMedians(as.matrix(d[-1]))
# [1] 17.46551 16.40221 15.71673 15.65965 17.77517 16.54246 15.87096 15.81245 16.34356 16.89695
matrixStats::rowMeans2(as.matrix(d[-1]))
# [1] 17.44417 16.35862 16.33775 16.15837 17.50521 16.27728 16.95862 16.12661 16.25687 16.66180
stopifnot(all.equal(matrixStats::rowMedians(as.matrix(d[-1])),
as.numeric(apply(d[-1], 1, median, na.rm=T))))
stopifnot(all.equal(matrixStats::rowMeans2(as.matrix(d[-1])),
as.numeric(apply(d[-1], 1, mean, na.rm=T))))
数据:
d <- structure(list(student = c("student1", "student2", "student3",
"student4", "student5", "student6", "student7", "student8", "student9",
"student10"), first_run = c(19.70847, 11.22369, 15.93649, 16.18733,
18.71084, 19.75246, 14.75144, 17.06516, 20.27611, 17.70736),
second_run = c(21.79771, 15.36253, 17.03599, 15.13197, 18.85453,
12.74605, 23.82376, 15.63075, 12.44592, 16.2162), third_run = c(16.49083,
16.90215, 14.20214, 14.79481, 17.15864, 18.52214, 18.51366,
13.72026, 12.26502, 14.10861), fourth_run = c(19.51691, 20.20724,
13.17548, 16.75177, 19.3888, 17.92626, 20.77424, 15.02068,
15.13456, 17.20014), fifth_run = c(13.97987, 15.90227, 14.70327,
14.51287, 15.68862, 14.48501, 14.22155, 14.21098, 14.61552,
16.59376), sixth_run = c(14.60733, 15.14539, 15.49697, 17.71816,
18.39169, 17.2078, 16.08186, 15.99414, 18.72192, 19.50027
), seventh_run = c(13.89703, 13.74945, 13.08945, 13.45054,
15.26428, 13.10512, 12.95981, 14.64818, 15.11129, 13.05073
), eight_run = c(15.24651, 18.3009, 19.94142, 14.25553, 16.04526,
12.46502, 12.6782, 16.15603, 17.60746, 15.80002), ninth_run = c(20.75679,
19.55124, 22.41674, 19.89091, 18.92532, 20.68583, 20.12166,
21.74607, 18.83831, 18.09781), tenth_run = c(18.4402, 17.24132,
17.37958, 18.88981, 16.62409, 15.87711, 15.66006, 17.07382,
17.55257, 18.34313)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))