在 R 中编写汇总统计函数
Writing a function for summary statistics in R
我遇到了一个我无法弄清楚的问题...基本上我想为许多变量生成每组的均值、SD 和 N。我的数据如下所示:
dataSet <- data.frame(study_id=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
Timepoint=c(1,6,12,18,1,6,12,18,1,6,12,18,1,6,12,18),
Secretor=c(0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1),
Gene1=runif(16, min=0, max=100),
Gene2=runif(16, min=0, max=100),
Gene3=runif(16, min=0, max=100),
Gene4=runif(16, min=0, max=100))
然后我把它分组...
library(tidyverse)
grouped_dataSet <- dataSet %>%
group_by(Secretor, Timepoint)
当我运行以下代码行时,我得到了我想要的:
summarise(grouped_dataSet, mean = mean(Gene1, na.rm=T), sd = sd(Gene1, na.rm=T), n = n())
输出:
# A tibble: 8 x 5
# Groups: Secretor [2]
Secretor Timepoint mean sd n
<dbl> <dbl> <dbl> <dbl> <int>
1 0 1 21.8 18.6 2
2 0 6 34.8 33.2 2
3 0 12 43.1 4.34 2
4 0 18 72.6 38.0 2
5 1 1 13.3 15.3 2
6 1 6 41.2 22.8 2
7 1 12 44.9 25.7 2
8 1 18 37.0 8.49 2
但是,当我将同一行代码编写为一个函数时(我打算使用 tidyverse 的 purrr 包将其映射到许多列),它不起作用,而是为所有内容返回“NA”,除了第 n 列:
summary_function <- function(x) {
summary <- summarise(grouped_dataSet, mean = mean(x, na.rm=T), sd = sd(x, na.rm=T), n = n())
return(summary)
}
summary_function("Gene1")
输出:
# A tibble: 8 x 5
# Groups: Secretor [2]
Secretor Timepoint mean sd n
<dbl> <dbl> <dbl> <dbl> <int>
1 0 1 NA NA 2
2 0 6 NA NA 2
3 0 12 NA NA 2
4 0 18 NA NA 2
5 1 1 NA NA 2
6 1 6 NA NA 2
7 1 12 NA NA 2
8 1 18 NA NA 2
这是我收到的警告:
In var(if (is.vector(x) || is.factor(x)) x else as.double(x), ... :
NAs introduced by coercion
任何人都可以就为什么它作为一行代码而不是作为一个函数提供建议吗?
非常感谢。
我们可以使用 ensym
以便我们可以传递带引号或不带引号的参数,并且可以对其进行评估 (!!
)
summary_function <- function(x) {
x <- ensym(x)
summarise(grouped_dataSet,
mean = mean(!! x, na.rm=T), sd = sd(!!x, na.rm=T), n = n())
}
summary_function("Gene1")
# A tibble: 8 x 5
# Groups: Secretor [2]
# Secretor Timepoint mean sd n
# <dbl> <dbl> <dbl> <dbl> <int>
#1 0 1 69.4 2.25 2
#2 0 6 9.67 13.6 2
#3 0 12 39.5 10.6 2
#4 0 18 17.4 19.2 2
#5 1 1 41.0 54.0 2
#6 1 6 58.5 7.57 2
#7 1 12 75.5 1.42 2
#8 1 18 80.5 24.7 2
summary_function(Gene1)
# A tibble: 8 x 5
# Groups: Secretor [2]
# Secretor Timepoint mean sd n
# <dbl> <dbl> <dbl> <dbl> <int>
#1 0 1 69.4 2.25 2
#2 0 6 9.67 13.6 2
#3 0 12 39.5 10.6 2
#4 0 18 17.4 19.2 2
#5 1 1 41.0 54.0 2
#6 1 6 58.5 7.57 2
#7 1 12 75.5 1.42 2
#8 1 18 80.5 24.7 2
此外,为了在不同数据集中的可重用性,最好有额外的参数来获取数据集对象
@akrun 关于如何立即解决您的问题的建议是正确的。
另一种方法是使用 tidyr
的嵌套功能,方法是返回包含 data.frame 个结果的单个元素列表。
summary_function <- function(x) {
summary <- list(tibble(mean = mean(x, na.rm=T), sd = sd(x, na.rm=T), n = length(x[!is.na(x)])))
return(summary)
}
然后你可以使用across
对多列做同样的功能:
dataSet %>%
group_by(Secretor, Timepoint) %>%
summarize(across(Gene1:Gene4, summary_function))
# A tibble: 8 x 6
# Groups: Secretor [2]
# Secretor Timepoint Gene1 Gene2 Gene3 Gene4
# <dbl> <dbl> <list> <list> <list> <list>
#1 0 1 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#2 0 6 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#3 0 12 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#4 0 18 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#5 1 1 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#6 1 6 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#7 1 12 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#8 1 18 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
现在我们可以使用 unnest
和 names_sep =
:
取消嵌套那些相同的列
dataSet %>%
group_by(Secretor, Timepoint) %>%
summarize(across(Gene1:Gene4, summary_function)) %>%
unnest(Gene1:Gene4, names_sep = "_")
# A tibble: 8 x 14
# Groups: Secretor [2]
# Secretor Timepoint Gene1_mean Gene1_sd Gene1_n Gene2_mean Gene2_sd Gene2_n Gene3_mean Gene3_sd Gene3_n
# <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <int> <dbl> <dbl> <int>
#1 0 1 71.2 28.6 2 62.3 27.0 2 28.4 33.3 2
#2 0 6 5.40 7.43 2 58.6 29.1 2 37.0 33.9 2
#3 0 12 91.8 11.4 2 53.9 31.0 2 33.2 46.0 2
#4 0 18 51.5 65.0 2 65.3 40.2 2 63.8 32.7 2
#5 1 1 30.8 18.0 2 50.0 19.9 2 22.8 6.71 2
#6 1 6 63.9 49.2 2 59.9 41.8 2 30.9 39.5 2
#7 1 12 85.3 6.74 2 51.0 41.1 2 28.5 22.9 2
#8 1 18 41.7 44.8 2 80.2 24.0 2 64.7 17.4 2
## … with 3 more variables: Gene4_mean <dbl>, Gene4_sd <dbl>, Gene4_n <int>
这是 tidyr
和 dplyr
(版本 >=1.0.0
)的最新补充,但可以派上用场。
我遇到了一个我无法弄清楚的问题...基本上我想为许多变量生成每组的均值、SD 和 N。我的数据如下所示:
dataSet <- data.frame(study_id=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
Timepoint=c(1,6,12,18,1,6,12,18,1,6,12,18,1,6,12,18),
Secretor=c(0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1),
Gene1=runif(16, min=0, max=100),
Gene2=runif(16, min=0, max=100),
Gene3=runif(16, min=0, max=100),
Gene4=runif(16, min=0, max=100))
然后我把它分组...
library(tidyverse)
grouped_dataSet <- dataSet %>%
group_by(Secretor, Timepoint)
当我运行以下代码行时,我得到了我想要的:
summarise(grouped_dataSet, mean = mean(Gene1, na.rm=T), sd = sd(Gene1, na.rm=T), n = n())
输出:
# A tibble: 8 x 5
# Groups: Secretor [2]
Secretor Timepoint mean sd n
<dbl> <dbl> <dbl> <dbl> <int>
1 0 1 21.8 18.6 2
2 0 6 34.8 33.2 2
3 0 12 43.1 4.34 2
4 0 18 72.6 38.0 2
5 1 1 13.3 15.3 2
6 1 6 41.2 22.8 2
7 1 12 44.9 25.7 2
8 1 18 37.0 8.49 2
但是,当我将同一行代码编写为一个函数时(我打算使用 tidyverse 的 purrr 包将其映射到许多列),它不起作用,而是为所有内容返回“NA”,除了第 n 列:
summary_function <- function(x) {
summary <- summarise(grouped_dataSet, mean = mean(x, na.rm=T), sd = sd(x, na.rm=T), n = n())
return(summary)
}
summary_function("Gene1")
输出:
# A tibble: 8 x 5
# Groups: Secretor [2]
Secretor Timepoint mean sd n
<dbl> <dbl> <dbl> <dbl> <int>
1 0 1 NA NA 2
2 0 6 NA NA 2
3 0 12 NA NA 2
4 0 18 NA NA 2
5 1 1 NA NA 2
6 1 6 NA NA 2
7 1 12 NA NA 2
8 1 18 NA NA 2
这是我收到的警告:
In var(if (is.vector(x) || is.factor(x)) x else as.double(x), ... :
NAs introduced by coercion
任何人都可以就为什么它作为一行代码而不是作为一个函数提供建议吗?
非常感谢。
我们可以使用 ensym
以便我们可以传递带引号或不带引号的参数,并且可以对其进行评估 (!!
)
summary_function <- function(x) {
x <- ensym(x)
summarise(grouped_dataSet,
mean = mean(!! x, na.rm=T), sd = sd(!!x, na.rm=T), n = n())
}
summary_function("Gene1")
# A tibble: 8 x 5
# Groups: Secretor [2]
# Secretor Timepoint mean sd n
# <dbl> <dbl> <dbl> <dbl> <int>
#1 0 1 69.4 2.25 2
#2 0 6 9.67 13.6 2
#3 0 12 39.5 10.6 2
#4 0 18 17.4 19.2 2
#5 1 1 41.0 54.0 2
#6 1 6 58.5 7.57 2
#7 1 12 75.5 1.42 2
#8 1 18 80.5 24.7 2
summary_function(Gene1)
# A tibble: 8 x 5
# Groups: Secretor [2]
# Secretor Timepoint mean sd n
# <dbl> <dbl> <dbl> <dbl> <int>
#1 0 1 69.4 2.25 2
#2 0 6 9.67 13.6 2
#3 0 12 39.5 10.6 2
#4 0 18 17.4 19.2 2
#5 1 1 41.0 54.0 2
#6 1 6 58.5 7.57 2
#7 1 12 75.5 1.42 2
#8 1 18 80.5 24.7 2
此外,为了在不同数据集中的可重用性,最好有额外的参数来获取数据集对象
@akrun 关于如何立即解决您的问题的建议是正确的。
另一种方法是使用 tidyr
的嵌套功能,方法是返回包含 data.frame 个结果的单个元素列表。
summary_function <- function(x) {
summary <- list(tibble(mean = mean(x, na.rm=T), sd = sd(x, na.rm=T), n = length(x[!is.na(x)])))
return(summary)
}
然后你可以使用across
对多列做同样的功能:
dataSet %>%
group_by(Secretor, Timepoint) %>%
summarize(across(Gene1:Gene4, summary_function))
# A tibble: 8 x 6
# Groups: Secretor [2]
# Secretor Timepoint Gene1 Gene2 Gene3 Gene4
# <dbl> <dbl> <list> <list> <list> <list>
#1 0 1 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#2 0 6 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#3 0 12 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#4 0 18 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#5 1 1 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#6 1 6 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#7 1 12 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#8 1 18 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
现在我们可以使用 unnest
和 names_sep =
:
dataSet %>%
group_by(Secretor, Timepoint) %>%
summarize(across(Gene1:Gene4, summary_function)) %>%
unnest(Gene1:Gene4, names_sep = "_")
# A tibble: 8 x 14
# Groups: Secretor [2]
# Secretor Timepoint Gene1_mean Gene1_sd Gene1_n Gene2_mean Gene2_sd Gene2_n Gene3_mean Gene3_sd Gene3_n
# <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <int> <dbl> <dbl> <int>
#1 0 1 71.2 28.6 2 62.3 27.0 2 28.4 33.3 2
#2 0 6 5.40 7.43 2 58.6 29.1 2 37.0 33.9 2
#3 0 12 91.8 11.4 2 53.9 31.0 2 33.2 46.0 2
#4 0 18 51.5 65.0 2 65.3 40.2 2 63.8 32.7 2
#5 1 1 30.8 18.0 2 50.0 19.9 2 22.8 6.71 2
#6 1 6 63.9 49.2 2 59.9 41.8 2 30.9 39.5 2
#7 1 12 85.3 6.74 2 51.0 41.1 2 28.5 22.9 2
#8 1 18 41.7 44.8 2 80.2 24.0 2 64.7 17.4 2
## … with 3 more variables: Gene4_mean <dbl>, Gene4_sd <dbl>, Gene4_n <int>
这是 tidyr
和 dplyr
(版本 >=1.0.0
)的最新补充,但可以派上用场。