计算行的中位数和均值(在 R 中)

Calculating the Medians and Means of Rows (in R)

我正在使用 R 编程语言。假设我有以下数据 ("my_data"):

   student first_run second_run third_run fourth_run fifth_run sixth_run seventh_run eight_run ninth_run tenth_run
1   student1  19.70847   21.79771  16.49083   19.51691  13.97987  14.60733    13.89703  15.24651  20.75679  18.44020
2   student2  11.22369   15.36253  16.90215   20.20724  15.90227  15.14539    13.74945  18.30090  19.55124  17.24132
3   student3  15.93649   17.03599  14.20214   13.17548  14.70327  15.49697    13.08945  19.94142  22.41674  17.37958
4   student4  16.18733   15.13197  14.79481   16.75177  14.51287  17.71816    13.45054  14.25553  19.89091  18.88981
5   student5  18.71084   18.85453  17.15864   19.38880  15.68862  18.39169    15.26428  16.04526  18.92532  16.62409
6   student6  19.75246   12.74605  18.52214   17.92626  14.48501  17.20780    13.10512  12.46502  20.68583  15.87711
7   student7  14.75144   23.82376  18.51366   20.77424  14.22155  16.08186    12.95981  12.67820  20.12166  15.66006
8   student8  17.06516   15.63075  13.72026   15.02068  14.21098  15.99414    14.64818  16.15603  21.74607  17.07382
9   student9  20.27611   12.44592  12.26502   15.13456  14.61552  18.72192    15.11129  17.60746  18.83831  17.55257
10 student10  17.70736   16.21620  14.10861   17.20014  16.59376  19.50027    13.05073  15.80002  18.09781  18.34313

我想向此数据添加 2 列:

我在 R 中尝试了以下代码:

my_data$median = apply(my_data, 1, median, na.rm=T)

my_data$mean = apply(my_data, 1, mean, na.rm=T)

但我认为这段代码不正确。例如,使用此代码时,第二行数据的中位数返回为“16.90215”

但是当我手动取这一行的中位数时:

median(11.22369  , 15.36253 , 16.90215 ,  20.20724,  15.90227 , 15.14539   , 13.74945 , 18.30090 , 19.55124 , 17.24132)

我得到了

的答案
11.22

有人可以告诉我我做错了什么吗?

谢谢

library(dplyr)

df %>% 
  rowwise() %>% 
  mutate(median = median(c_across(where(is.numeric))),
         mean = mean(c_across(where(is.numeric))))

c_acrossrowwise 就是为这种情况创建的。大多数动词按列工作。要先将此行为管道更改为 rowwise

c_across 然后会将一行中的所有值组合为数字(因此 where(is.numeric) 到一个数字向量中,然后可以应用 meanmedian

注意:您可能希望将输出传输到 ungroup,因为 rowwise 创建了按行分组的数据框。

计算不正确,即 median 的第一个参数是 'x',它可以是一个向量。第二个参数是 na.rm,后跟可变参数 ...。所以,当写11.22369, 15.36253时,'x'被当作11.22369,这就是返回值。相反,它应该是一个串联的向量 c

median(c(11.22369  , 15.36253 , 16.90215 ,  20.20724,  15.90227 , 15.14539   , 13.74945 , 18.30090 , 19.55124 , 17.24132))
[1] 16.40221

此外,根据 OP 的数据,应该删除第一列,即字符或因子

 apply(my_data[-1], 1, median, na.rm=TRUE)
       1        2        3        4        5        6        7        8        9       10 
17.46551 16.40221 15.71673 15.65965 17.77517 16.54246 15.87096 15.81245 16.34356 16.89695 

第二行用于manual计算

这是使用 pmap 的替代方法,同时传递所有参数,因此使用省略号,即 ...。输出需要与 tidyr:

中的 unnest_wider 解除嵌套
library(tidyr)
library(dplyr)
library(purrr)
df %>% 
  mutate(res = pmap(across(where(is.numeric)),
                    ~ list(median = median(c(...)),
                           avg = mean(c(...))))) %>%
  unnest_wider(res)

输出:

  student   first_run second_run third_run fourth_run fifth_run sixth_run seventh_run eight_run ninth_run tenth_run median   avg
   <chr>         <dbl>      <dbl>     <dbl>      <dbl>     <dbl>     <dbl>       <dbl>     <dbl>     <dbl>     <dbl>  <dbl> <dbl>
 1 student1       19.7       21.8      16.5       19.5      14.0      14.6        13.9      15.2      20.8      18.4   17.5  17.4
 2 student2       11.2       15.4      16.9       20.2      15.9      15.1        13.7      18.3      19.6      17.2   16.4  16.4
 3 student3       15.9       17.0      14.2       13.2      14.7      15.5        13.1      19.9      22.4      17.4   15.7  16.3
 4 student4       16.2       15.1      14.8       16.8      14.5      17.7        13.5      14.3      19.9      18.9   15.7  16.2
 5 student5       18.7       18.9      17.2       19.4      15.7      18.4        15.3      16.0      18.9      16.6   17.8  17.5
 6 student6       19.8       12.7      18.5       17.9      14.5      17.2        13.1      12.5      20.7      15.9   16.5  16.3
 7 student7       14.8       23.8      18.5       20.8      14.2      16.1        13.0      12.7      20.1      15.7   15.9  17.0
 8 student8       17.1       15.6      13.7       15.0      14.2      16.0        14.6      16.2      21.7      17.1   15.8  16.1
 9 student9       20.3       12.4      12.3       15.1      14.6      18.7        15.1      17.6      18.8      17.6   16.3  16.3
10 student10      17.7       16.2      14.1       17.2      16.6      19.5        13.1      15.8      18.1      18.3   16.9  16.7

您绝对可以受益于 matrixStats 库的速度。

matrixStats::rowMedians(as.matrix(d[-1]))
# [1] 17.46551 16.40221 15.71673 15.65965 17.77517 16.54246 15.87096 15.81245 16.34356 16.89695
matrixStats::rowMeans2(as.matrix(d[-1]))
# [1] 17.44417 16.35862 16.33775 16.15837 17.50521 16.27728 16.95862 16.12661 16.25687 16.66180

stopifnot(all.equal(matrixStats::rowMedians(as.matrix(d[-1])),
                    as.numeric(apply(d[-1], 1, median, na.rm=T))))
stopifnot(all.equal(matrixStats::rowMeans2(as.matrix(d[-1])),
                    as.numeric(apply(d[-1], 1, mean, na.rm=T))))

数据:

d <- structure(list(student = c("student1", "student2", "student3", 
"student4", "student5", "student6", "student7", "student8", "student9", 
"student10"), first_run = c(19.70847, 11.22369, 15.93649, 16.18733, 
18.71084, 19.75246, 14.75144, 17.06516, 20.27611, 17.70736), 
    second_run = c(21.79771, 15.36253, 17.03599, 15.13197, 18.85453, 
    12.74605, 23.82376, 15.63075, 12.44592, 16.2162), third_run = c(16.49083, 
    16.90215, 14.20214, 14.79481, 17.15864, 18.52214, 18.51366, 
    13.72026, 12.26502, 14.10861), fourth_run = c(19.51691, 20.20724, 
    13.17548, 16.75177, 19.3888, 17.92626, 20.77424, 15.02068, 
    15.13456, 17.20014), fifth_run = c(13.97987, 15.90227, 14.70327, 
    14.51287, 15.68862, 14.48501, 14.22155, 14.21098, 14.61552, 
    16.59376), sixth_run = c(14.60733, 15.14539, 15.49697, 17.71816, 
    18.39169, 17.2078, 16.08186, 15.99414, 18.72192, 19.50027
    ), seventh_run = c(13.89703, 13.74945, 13.08945, 13.45054, 
    15.26428, 13.10512, 12.95981, 14.64818, 15.11129, 13.05073
    ), eight_run = c(15.24651, 18.3009, 19.94142, 14.25553, 16.04526, 
    12.46502, 12.6782, 16.15603, 17.60746, 15.80002), ninth_run = c(20.75679, 
    19.55124, 22.41674, 19.89091, 18.92532, 20.68583, 20.12166, 
    21.74607, 18.83831, 18.09781), tenth_run = c(18.4402, 17.24132, 
    17.37958, 18.88981, 16.62409, 15.87711, 15.66006, 17.07382, 
    17.55257, 18.34313)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10"))