在 R 中编写汇总统计函数

Question

我遇到了一个我无法弄清楚的问题...基本上我想为许多变量生成每组的均值、SD 和 N。我的数据如下所示：

dataSet <- data.frame(study_id=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
                      Timepoint=c(1,6,12,18,1,6,12,18,1,6,12,18,1,6,12,18),
                      Secretor=c(0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1),
                      Gene1=runif(16, min=0, max=100),
                      Gene2=runif(16, min=0, max=100),
                      Gene3=runif(16, min=0, max=100),
                      Gene4=runif(16, min=0, max=100))

然后我把它分组...

library(tidyverse)

grouped_dataSet <- dataSet %>%
  group_by(Secretor, Timepoint)

当我运行以下代码行时，我得到了我想要的：

summarise(grouped_dataSet, mean = mean(Gene1, na.rm=T), sd = sd(Gene1, na.rm=T), n = n())

输出：

# A tibble: 8 x 5
# Groups:   Secretor [2]
  Secretor Timepoint  mean    sd     n
     <dbl>     <dbl> <dbl> <dbl> <int>
1        0         1  21.8 18.6      2
2        0         6  34.8 33.2      2
3        0        12  43.1  4.34     2
4        0        18  72.6 38.0      2
5        1         1  13.3 15.3      2
6        1         6  41.2 22.8      2
7        1        12  44.9 25.7      2
8        1        18  37.0  8.49     2

但是，当我将同一行代码编写为一个函数时（我打算使用 tidyverse 的 purrr 包将其映射到许多列），它不起作用，而是为所有内容返回“NA”，除了第 n 列：

summary_function <- function(x) {
  summary <- summarise(grouped_dataSet, mean = mean(x, na.rm=T), sd = sd(x, na.rm=T), n = n())
  return(summary)
}

summary_function("Gene1")

输出：

# A tibble: 8 x 5
# Groups:   Secretor [2]
  Secretor Timepoint  mean    sd     n
     <dbl>     <dbl> <dbl> <dbl> <int>
1        0         1    NA    NA     2
2        0         6    NA    NA     2
3        0        12    NA    NA     2
4        0        18    NA    NA     2
5        1         1    NA    NA     2
6        1         6    NA    NA     2
7        1        12    NA    NA     2
8        1        18    NA    NA     2

这是我收到的警告：

In var(if (is.vector(x) || is.factor(x)) x else as.double(x),  ... :
  NAs introduced by coercion

任何人都可以就为什么它作为一行代码而不是作为一个函数提供建议吗？

非常感谢。

Answer 1

我们可以使用 ensym 以便我们可以传递带引号或不带引号的参数，并且可以对其进行评估 (!!)

summary_function <- function(x) {
   x <- ensym(x)
    summarise(grouped_dataSet, 
        mean = mean(!! x, na.rm=T), sd = sd(!!x, na.rm=T), n = n())

  }

summary_function("Gene1")
# A tibble: 8 x 5
# Groups:   Secretor [2]
#  Secretor Timepoint  mean    sd     n
#     <dbl>     <dbl> <dbl> <dbl> <int>
#1        0         1 69.4   2.25     2
#2        0         6  9.67 13.6      2
#3        0        12 39.5  10.6      2
#4        0        18 17.4  19.2      2
#5        1         1 41.0  54.0      2
#6        1         6 58.5   7.57     2
#7        1        12 75.5   1.42     2
#8        1        18 80.5  24.7      2


summary_function(Gene1)
# A tibble: 8 x 5
# Groups:   Secretor [2]
#  Secretor Timepoint  mean    sd     n
#     <dbl>     <dbl> <dbl> <dbl> <int>
#1        0         1 69.4   2.25     2
#2        0         6  9.67 13.6      2
#3        0        12 39.5  10.6      2
#4        0        18 17.4  19.2      2
#5        1         1 41.0  54.0      2
#6        1         6 58.5   7.57     2
#7        1        12 75.5   1.42     2
#8        1        18 80.5  24.7      2

此外，为了在不同数据集中的可重用性，最好有额外的参数来获取数据集对象

Answer 2

@akrun 关于如何立即解决您的问题的建议是正确的。

另一种方法是使用 tidyr 的嵌套功能，方法是返回包含 data.frame 个结果的单个元素列表。

summary_function <- function(x) {
  summary <- list(tibble(mean = mean(x, na.rm=T), sd = sd(x, na.rm=T), n = length(x[!is.na(x)])))
  return(summary)
}

然后你可以使用across对多列做同样的功能：

dataSet %>%
  group_by(Secretor, Timepoint) %>% 
  summarize(across(Gene1:Gene4, summary_function))
# A tibble: 8 x 6
# Groups:   Secretor [2]
#  Secretor Timepoint Gene1            Gene2            Gene3            Gene4           
#     <dbl>     <dbl> <list>           <list>           <list>           <list>          
#1        0         1 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#2        0         6 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#3        0        12 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#4        0        18 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#5        1         1 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#6        1         6 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#7        1        12 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#8        1        18 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>

现在我们可以使用 unnest 和 names_sep =:

取消嵌套那些相同的列

dataSet %>%
  group_by(Secretor, Timepoint) %>% 
  summarize(across(Gene1:Gene4, summary_function)) %>%
  unnest(Gene1:Gene4, names_sep = "_")
# A tibble: 8 x 14
# Groups:   Secretor [2]
#  Secretor Timepoint Gene1_mean Gene1_sd Gene1_n Gene2_mean Gene2_sd Gene2_n Gene3_mean Gene3_sd Gene3_n
#     <dbl>     <dbl>      <dbl>    <dbl>   <int>      <dbl>    <dbl>   <int>      <dbl>    <dbl>   <int>
#1        0         1      71.2     28.6        2       62.3     27.0       2       28.4    33.3        2
#2        0         6       5.40     7.43       2       58.6     29.1       2       37.0    33.9        2
#3        0        12      91.8     11.4        2       53.9     31.0       2       33.2    46.0        2
#4        0        18      51.5     65.0        2       65.3     40.2       2       63.8    32.7        2
#5        1         1      30.8     18.0        2       50.0     19.9       2       22.8     6.71       2
#6        1         6      63.9     49.2        2       59.9     41.8       2       30.9    39.5        2
#7        1        12      85.3      6.74       2       51.0     41.1       2       28.5    22.9        2
#8        1        18      41.7     44.8        2       80.2     24.0       2       64.7    17.4        2
## … with 3 more variables: Gene4_mean <dbl>, Gene4_sd <dbl>, Gene4_n <int>

这是 tidyr 和 dplyr（版本 >=1.0.0）的最新补充，但可以派上用场。

在 R 中编写汇总统计函数

Writing a function for summary statistics in R

statistics

r

function

summary