计算行的中位数和均值（在 R 中）

Question

我正在使用 R 编程语言。假设我有以下数据 ("my_data"):

   student first_run second_run third_run fourth_run fifth_run sixth_run seventh_run eight_run ninth_run tenth_run
1   student1  19.70847   21.79771  16.49083   19.51691  13.97987  14.60733    13.89703  15.24651  20.75679  18.44020
2   student2  11.22369   15.36253  16.90215   20.20724  15.90227  15.14539    13.74945  18.30090  19.55124  17.24132
3   student3  15.93649   17.03599  14.20214   13.17548  14.70327  15.49697    13.08945  19.94142  22.41674  17.37958
4   student4  16.18733   15.13197  14.79481   16.75177  14.51287  17.71816    13.45054  14.25553  19.89091  18.88981
5   student5  18.71084   18.85453  17.15864   19.38880  15.68862  18.39169    15.26428  16.04526  18.92532  16.62409
6   student6  19.75246   12.74605  18.52214   17.92626  14.48501  17.20780    13.10512  12.46502  20.68583  15.87711
7   student7  14.75144   23.82376  18.51366   20.77424  14.22155  16.08186    12.95981  12.67820  20.12166  15.66006
8   student8  17.06516   15.63075  13.72026   15.02068  14.21098  15.99414    14.64818  16.15603  21.74607  17.07382
9   student9  20.27611   12.44592  12.26502   15.13456  14.61552  18.72192    15.11129  17.60746  18.83831  17.55257
10 student10  17.70736   16.21620  14.10861   17.20014  16.59376  19.50027    13.05073  15.80002  18.09781  18.34313

我想向此数据添加 2 列：

my_mean：每行的平均值
my_median：每行的中位数

我在 R 中尝试了以下代码：

my_data$median = apply(my_data, 1, median, na.rm=T)

my_data$mean = apply(my_data, 1, mean, na.rm=T)

但我认为这段代码不正确。例如，使用此代码时，第二行数据的中位数返回为“16.90215”

但是当我手动取这一行的中位数时：

median(11.22369  , 15.36253 , 16.90215 ,  20.20724,  15.90227 , 15.14539   , 13.74945 , 18.30090 , 19.55124 , 17.24132)

我得到了

的答案

11.22

有人可以告诉我我做错了什么吗？

谢谢

Answer 1

library(dplyr)

df %>% 
  rowwise() %>% 
  mutate(median = median(c_across(where(is.numeric))),
         mean = mean(c_across(where(is.numeric))))

c_across 和 rowwise 就是为这种情况创建的。大多数动词按列工作。要先将此行为管道更改为 rowwise。

c_across 然后会将一行中的所有值组合为数字（因此 where(is.numeric) 到一个数字向量中，然后可以应用 mean 或 median。

注意：您可能希望将输出传输到 ungroup，因为 rowwise 创建了按行分组的数据框。

Answer 2

计算不正确，即 median 的第一个参数是 'x'，它可以是一个向量。第二个参数是 na.rm，后跟可变参数 ...。所以，当写11.22369, 15.36253时，'x'被当作11.22369，这就是返回值。相反，它应该是一个串联的向量 c

median(c(11.22369  , 15.36253 , 16.90215 ,  20.20724,  15.90227 , 15.14539   , 13.74945 , 18.30090 , 19.55124 , 17.24132))
[1] 16.40221

此外，根据 OP 的数据，应该删除第一列，即字符或因子

 apply(my_data[-1], 1, median, na.rm=TRUE)
       1        2        3        4        5        6        7        8        9       10 
17.46551 16.40221 15.71673 15.65965 17.77517 16.54246 15.87096 15.81245 16.34356 16.89695

第二行用于manual计算

Answer 3

这是使用 pmap 的替代方法，同时传递所有参数，因此使用省略号，即 ...。输出需要与 tidyr:

中的 unnest_wider 解除嵌套

library(tidyr)
library(dplyr)
library(purrr)
df %>% 
  mutate(res = pmap(across(where(is.numeric)),
                    ~ list(median = median(c(...)),
                           avg = mean(c(...))))) %>%
  unnest_wider(res)

输出：

  student   first_run second_run third_run fourth_run fifth_run sixth_run seventh_run eight_run ninth_run tenth_run median   avg
   <chr>         <dbl>      <dbl>     <dbl>      <dbl>     <dbl>     <dbl>       <dbl>     <dbl>     <dbl>     <dbl>  <dbl> <dbl>
 1 student1       19.7       21.8      16.5       19.5      14.0      14.6        13.9      15.2      20.8      18.4   17.5  17.4
 2 student2       11.2       15.4      16.9       20.2      15.9      15.1        13.7      18.3      19.6      17.2   16.4  16.4
 3 student3       15.9       17.0      14.2       13.2      14.7      15.5        13.1      19.9      22.4      17.4   15.7  16.3
 4 student4       16.2       15.1      14.8       16.8      14.5      17.7        13.5      14.3      19.9      18.9   15.7  16.2
 5 student5       18.7       18.9      17.2       19.4      15.7      18.4        15.3      16.0      18.9      16.6   17.8  17.5
 6 student6       19.8       12.7      18.5       17.9      14.5      17.2        13.1      12.5      20.7      15.9   16.5  16.3
 7 student7       14.8       23.8      18.5       20.8      14.2      16.1        13.0      12.7      20.1      15.7   15.9  17.0
 8 student8       17.1       15.6      13.7       15.0      14.2      16.0        14.6      16.2      21.7      17.1   15.8  16.1
 9 student9       20.3       12.4      12.3       15.1      14.6      18.7        15.1      17.6      18.8      17.6   16.3  16.3
10 student10      17.7       16.2      14.1       17.2      16.6      19.5        13.1      15.8      18.1      18.3   16.9  16.7

Answer 4

您绝对可以受益于 matrixStats 库的速度。

matrixStats::rowMedians(as.matrix(d[-1]))
# [1] 17.46551 16.40221 15.71673 15.65965 17.77517 16.54246 15.87096 15.81245 16.34356 16.89695
matrixStats::rowMeans2(as.matrix(d[-1]))
# [1] 17.44417 16.35862 16.33775 16.15837 17.50521 16.27728 16.95862 16.12661 16.25687 16.66180

stopifnot(all.equal(matrixStats::rowMedians(as.matrix(d[-1])),
                    as.numeric(apply(d[-1], 1, median, na.rm=T))))
stopifnot(all.equal(matrixStats::rowMeans2(as.matrix(d[-1])),
                    as.numeric(apply(d[-1], 1, mean, na.rm=T))))

数据：

d <- structure(list(student = c("student1", "student2", "student3", 
"student4", "student5", "student6", "student7", "student8", "student9", 
"student10"), first_run = c(19.70847, 11.22369, 15.93649, 16.18733, 
18.71084, 19.75246, 14.75144, 17.06516, 20.27611, 17.70736), 
    second_run = c(21.79771, 15.36253, 17.03599, 15.13197, 18.85453, 
    12.74605, 23.82376, 15.63075, 12.44592, 16.2162), third_run = c(16.49083, 
    16.90215, 14.20214, 14.79481, 17.15864, 18.52214, 18.51366, 
    13.72026, 12.26502, 14.10861), fourth_run = c(19.51691, 20.20724, 
    13.17548, 16.75177, 19.3888, 17.92626, 20.77424, 15.02068, 
    15.13456, 17.20014), fifth_run = c(13.97987, 15.90227, 14.70327, 
    14.51287, 15.68862, 14.48501, 14.22155, 14.21098, 14.61552, 
    16.59376), sixth_run = c(14.60733, 15.14539, 15.49697, 17.71816, 
    18.39169, 17.2078, 16.08186, 15.99414, 18.72192, 19.50027
    ), seventh_run = c(13.89703, 13.74945, 13.08945, 13.45054, 
    15.26428, 13.10512, 12.95981, 14.64818, 15.11129, 13.05073
    ), eight_run = c(15.24651, 18.3009, 19.94142, 14.25553, 16.04526, 
    12.46502, 12.6782, 16.15603, 17.60746, 15.80002), ninth_run = c(20.75679, 
    19.55124, 22.41674, 19.89091, 18.92532, 20.68583, 20.12166, 
    21.74607, 18.83831, 18.09781), tenth_run = c(18.4402, 17.24132, 
    17.37958, 18.88981, 16.62409, 15.87711, 15.66006, 17.07382, 
    17.55257, 18.34313)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10"))

计算行的中位数和均值（在 R 中）

Calculating the Medians and Means of Rows (in R)

average

r

data-manipulation

mean