如何在 R 的 For 循环中正确使用 group_by() 和 summarise()
How to correctly use group_by() and summarise() in a For loop in R
我正在尝试计算一些摘要信息以帮助我检查数据集中不同组中的异常值。我可以使用 dplyr::group_by()
和 dplyr::summarise()
获得我想要的输出类型 - 一个数据框,其中包含给定变量的每个组的摘要信息。像这样:
Sepal.Length_outlier_check <- iris %>%
dplyr::group_by(Species) %>%
dplyr::summarise(min = min(Sepal.Length, na.rm = TRUE),
max = max(Sepal.Length, na.rm = TRUE),
median = median(Sepal.Length, na.rm = TRUE),
MAD = mad(Sepal.Length, na.rm = TRUE),
MAD_lowlim = median - (3 * MAD),
MAD_highlim = median + (3 * MAD),
Outliers_low = any(Sepal.Length < MAD_lowlim, na.rm = TRUE),
Outliers_high = any(Sepal.Length > MAD_highlim, na.rm = TRUE)
)
Sepal.Length_outlier_check
但是,我希望能够将其放入 For 循环中,以便能够为数据集中的每个不同变量生成类似的摘要数据帧。我刚开始使用循环,但我认为它可能需要看起来像这样:
vars <- list(colnames(iris))
for (i in vars) {
x <- iris %>%
dplyr::group_by(Species) %>%
dplyr::summarise(min = min(i, na.rm = TRUE),
max = max(i, na.rm = TRUE),
median = median(i, na.rm = TRUE),
MAD = mad(i, na.rm = TRUE),
MAD_lowlim = median - (3 * MAD),
MAD_highlim = median + (3 * MAD),
Outliers_low = any(i < MAD_lowlim, na.rm = TRUE),
Outliers_high = any(i > MAD_highlim, na.rm = TRUE)
)
assign(paste(i, "Outlier_check", sep = "_"), x)
}
我知道这不起作用,因为在摘要函数中 i
实际上并未引用任何数据。我不确定我需要做什么才能让它发挥作用!我将非常感谢您的帮助,或者任何关于如何更优雅地完成所有这些的建议。
我不愿意使用 dplyr::summarise_all() 因为它会为所有变量输出一个摘要 table,并且由于我正在处理的真实数据集有很多变量,所以这个摘要 table 会变得太大而无法轻松查看。
谢谢。
这实际上很棘手,当我问 时,我自己也有同样的疑惑。
这是一种方法
for(i in colnames(iris)[1:4]) {
iris$artificialcolumn <- iris[,which(colnames(iris)==i)]
print(i)
x <- iris %>%
dplyr::group_by(Species) %>%
dplyr::summarise(min = min(artificialcolumn , na.rm = TRUE),
max = max(artificialcolumn, na.rm = TRUE),
median = median(artificialcolumn, na.rm = TRUE),
MAD = mad(artificialcolumn, na.rm = TRUE),
MAD_lowlim = median - (3 * MAD),
MAD_highlim = median + (3 * MAD),
Outliers_low = any(artificialcolumn < MAD_lowlim, na.rm = TRUE),
Outliers_high = any(artificialcolumn > MAD_highlim, na.rm = TRUE)
)
}
x
结果:
> x
# A tibble: 3 x 9
Species min max median MAD MAD_lowlim MAD_highlim Outliers_low Outliers_high
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
1 setosa 0.1 0.6 0.2 0 0.2 0.2 TRUE TRUE
2 versicolor 1 1.8 1.3 0.222 0.633 1.97 FALSE FALSE
3 virginica 1.4 2.5 2 0.297 1.11 2.89 FALSE FALSE
第五列是一个因素,所以这个returns一个错误。
使用get(i)
即可解决主要问题。
至于结果,最好将它们保存在一个列表中,而不是在全局环境中有几个(在本例中为 4 个)不相关的对象。
library(dplyr)
vars <- colnames(iris)
vars <- vars[-which(vars == "Species")]
Outlier_check <- vector("list", length(vars))
for (i in vars) {
Outlier_check[[i]] <- iris %>%
group_by(Species) %>%
summarise(min = min(get(i), na.rm = TRUE),
max = max(get(i), na.rm = TRUE),
median = median(get(i), na.rm = TRUE),
MAD = mad(get(i), na.rm = TRUE),
MAD_lowlim = median - (3 * MAD),
MAD_highlim = median + (3 * MAD),
Outliers_low = any(get(i) < MAD_lowlim, na.rm = TRUE),
Outliers_high = any(get(i) > MAD_highlim, na.rm = TRUE)
)
}
Outlier_check$Sepal.Length
## A tibble: 3 x 9
# Species min max median MAD MAD_lowlim
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 setosa 4.3 5.8 5 0.297 4.11
#2 versic… 4.9 7 5.9 0.519 4.34
#3 virgin… 4.9 7.9 6.5 0.593 4.72
## ... with 3 more variables: MAD_highlim <dbl>,
## Outliers_low <lgl>, Outliers_high <lgl>
你也可以自己写一个函数,更简单灵活。使用整洁的评估方法,您将使用 rlang::sym()
将字符串转换为变量,然后在 summarise()
中使用 !!
取消引用它(砰砰)。
library(dplyr)
check_outlier <- function(df, .groupvar, .checkvar) {
.groupvar <- sym(.groupvar)
.checkvar <- sym(.checkvar)
df_outlier_check <- df %>%
dplyr::group_by(!! .groupvar) %>%
dplyr::summarise(min = min(!! .checkvar, na.rm = TRUE),
max = max(!! .checkvar, na.rm = TRUE),
median = median(!! .checkvar, na.rm = TRUE),
MAD = mad(!! .checkvar, na.rm = TRUE),
MAD_lowlim = median - (3 * MAD),
MAD_highlim = median + (3 * MAD),
Outliers_low = any(!! .checkvar < MAD_lowlim, na.rm = TRUE),
Outliers_high = any(!! .checkvar > MAD_highlim, na.rm = TRUE)
)
return(df_outlier_check)
}
# test function
check_outlier(iris, "Species", "Sepal.Length")
#> # A tibble: 3 x 9
#> Species min max median MAD MAD_lowlim MAD_highlim Outliers_low
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
#> 1 setosa 4.3 5.8 5 0.297 4.11 5.89 FALSE
#> 2 versic~ 4.9 7 5.9 0.519 4.34 7.46 FALSE
#> 3 virgin~ 4.9 7.9 6.5 0.593 4.72 8.28 FALSE
#> # ... with 1 more variable: Outliers_high <lgl>
遍历所有变量并使用 purrr::map_df()
将结果合并到一个数据框中
library(purrr)
vars <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")
vars %>%
set_names() %>%
map_df(~ check_outlier(iris, "Species", .x), .id = 'Variable')
#> # A tibble: 12 x 10
#> Variable Species min max median MAD MAD_lowlim MAD_highlim
#> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Sepal.L~ setosa 4.3 5.8 5 0.297 4.11 5.89
#> 2 Sepal.L~ versic~ 4.9 7 5.9 0.519 4.34 7.46
#> 3 Sepal.L~ virgin~ 4.9 7.9 6.5 0.593 4.72 8.28
#> 4 Sepal.W~ setosa 2.3 4.4 3.4 0.371 2.29 4.51
#> 5 Sepal.W~ versic~ 2 3.4 2.8 0.297 1.91 3.69
#> 6 Sepal.W~ virgin~ 2.2 3.8 3 0.297 2.11 3.89
#> 7 Petal.L~ setosa 1 1.9 1.5 0.148 1.06 1.94
#> 8 Petal.L~ versic~ 3 5.1 4.35 0.519 2.79 5.91
#> 9 Petal.L~ virgin~ 4.5 6.9 5.55 0.667 3.55 7.55
#> 10 Petal.W~ setosa 0.1 0.6 0.2 0 0.2 0.2
#> 11 Petal.W~ versic~ 1 1.8 1.3 0.222 0.633 1.97
#> 12 Petal.W~ virgin~ 1.4 2.5 2 0.297 1.11 2.89
#> # ... with 2 more variables: Outliers_low <lgl>, Outliers_high <lgl>
由 reprex package (v0.2.1.9000)
创建于 2018-10-20
您还可以创建这些 per-variable/species 没有循环或单独函数的摘要,只需 gather
非物种列、分组和总结:
library(tidyverse)
iris.summary <- iris %>%
gather(variable, value, -Species) %>%
group_by(variable, Species) %>%
summarize(
min = min(value, na.rm = TRUE),
max = max(value, na.rm = TRUE),
median = median(value, na.rm = TRUE),
MAD = mad(value, na.rm = TRUE),
MAD_lowlim = median - (3 * MAD),
MAD_highlim = median + (3 * MAD),
Outliers_low = any(value < MAD_lowlim, na.rm = TRUE),
Outliers_high = any(value > MAD_highlim, na.rm = TRUE)
)
variable Species min max median MAD MAD_lowlim MAD_highlim Outliers_low Outliers_high
<chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
1 Petal.Length setosa 1 1.9 1.5 0.148 1.06 1.94 TRUE FALSE
2 Petal.Length versicolor 3 5.1 4.35 0.519 2.79 5.91 FALSE FALSE
3 Petal.Length virginica 4.5 6.9 5.55 0.667 3.55 7.55 FALSE FALSE
4 Petal.Width setosa 0.1 0.6 0.2 0 0.2 0.2 TRUE TRUE
5 Petal.Width versicolor 1 1.8 1.3 0.222 0.633 1.97 FALSE FALSE
6 Petal.Width virginica 1.4 2.5 2 0.297 1.11 2.89 FALSE FALSE
7 Sepal.Length setosa 4.3 5.8 5 0.297 4.11 5.89 FALSE FALSE
8 Sepal.Length versicolor 4.9 7 5.9 0.519 4.34 7.46 FALSE FALSE
9 Sepal.Length virginica 4.9 7.9 6.5 0.593 4.72 8.28 FALSE FALSE
10 Sepal.Width setosa 2.3 4.4 3.4 0.371 2.29 4.51 FALSE FALSE
11 Sepal.Width versicolor 2 3.4 2.8 0.297 1.91 3.69 FALSE FALSE
12 Sepal.Width virginica 2.2 3.8 3 0.297 2.11 3.89 FALSE FALSE
我正在尝试计算一些摘要信息以帮助我检查数据集中不同组中的异常值。我可以使用 dplyr::group_by()
和 dplyr::summarise()
获得我想要的输出类型 - 一个数据框,其中包含给定变量的每个组的摘要信息。像这样:
Sepal.Length_outlier_check <- iris %>%
dplyr::group_by(Species) %>%
dplyr::summarise(min = min(Sepal.Length, na.rm = TRUE),
max = max(Sepal.Length, na.rm = TRUE),
median = median(Sepal.Length, na.rm = TRUE),
MAD = mad(Sepal.Length, na.rm = TRUE),
MAD_lowlim = median - (3 * MAD),
MAD_highlim = median + (3 * MAD),
Outliers_low = any(Sepal.Length < MAD_lowlim, na.rm = TRUE),
Outliers_high = any(Sepal.Length > MAD_highlim, na.rm = TRUE)
)
Sepal.Length_outlier_check
但是,我希望能够将其放入 For 循环中,以便能够为数据集中的每个不同变量生成类似的摘要数据帧。我刚开始使用循环,但我认为它可能需要看起来像这样:
vars <- list(colnames(iris))
for (i in vars) {
x <- iris %>%
dplyr::group_by(Species) %>%
dplyr::summarise(min = min(i, na.rm = TRUE),
max = max(i, na.rm = TRUE),
median = median(i, na.rm = TRUE),
MAD = mad(i, na.rm = TRUE),
MAD_lowlim = median - (3 * MAD),
MAD_highlim = median + (3 * MAD),
Outliers_low = any(i < MAD_lowlim, na.rm = TRUE),
Outliers_high = any(i > MAD_highlim, na.rm = TRUE)
)
assign(paste(i, "Outlier_check", sep = "_"), x)
}
我知道这不起作用,因为在摘要函数中 i
实际上并未引用任何数据。我不确定我需要做什么才能让它发挥作用!我将非常感谢您的帮助,或者任何关于如何更优雅地完成所有这些的建议。
我不愿意使用 dplyr::summarise_all() 因为它会为所有变量输出一个摘要 table,并且由于我正在处理的真实数据集有很多变量,所以这个摘要 table 会变得太大而无法轻松查看。
谢谢。
这实际上很棘手,当我问
这是一种方法
for(i in colnames(iris)[1:4]) {
iris$artificialcolumn <- iris[,which(colnames(iris)==i)]
print(i)
x <- iris %>%
dplyr::group_by(Species) %>%
dplyr::summarise(min = min(artificialcolumn , na.rm = TRUE),
max = max(artificialcolumn, na.rm = TRUE),
median = median(artificialcolumn, na.rm = TRUE),
MAD = mad(artificialcolumn, na.rm = TRUE),
MAD_lowlim = median - (3 * MAD),
MAD_highlim = median + (3 * MAD),
Outliers_low = any(artificialcolumn < MAD_lowlim, na.rm = TRUE),
Outliers_high = any(artificialcolumn > MAD_highlim, na.rm = TRUE)
)
}
x
结果:
> x
# A tibble: 3 x 9
Species min max median MAD MAD_lowlim MAD_highlim Outliers_low Outliers_high
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
1 setosa 0.1 0.6 0.2 0 0.2 0.2 TRUE TRUE
2 versicolor 1 1.8 1.3 0.222 0.633 1.97 FALSE FALSE
3 virginica 1.4 2.5 2 0.297 1.11 2.89 FALSE FALSE
第五列是一个因素,所以这个returns一个错误。
使用get(i)
即可解决主要问题。
至于结果,最好将它们保存在一个列表中,而不是在全局环境中有几个(在本例中为 4 个)不相关的对象。
library(dplyr)
vars <- colnames(iris)
vars <- vars[-which(vars == "Species")]
Outlier_check <- vector("list", length(vars))
for (i in vars) {
Outlier_check[[i]] <- iris %>%
group_by(Species) %>%
summarise(min = min(get(i), na.rm = TRUE),
max = max(get(i), na.rm = TRUE),
median = median(get(i), na.rm = TRUE),
MAD = mad(get(i), na.rm = TRUE),
MAD_lowlim = median - (3 * MAD),
MAD_highlim = median + (3 * MAD),
Outliers_low = any(get(i) < MAD_lowlim, na.rm = TRUE),
Outliers_high = any(get(i) > MAD_highlim, na.rm = TRUE)
)
}
Outlier_check$Sepal.Length
## A tibble: 3 x 9
# Species min max median MAD MAD_lowlim
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 setosa 4.3 5.8 5 0.297 4.11
#2 versic… 4.9 7 5.9 0.519 4.34
#3 virgin… 4.9 7.9 6.5 0.593 4.72
## ... with 3 more variables: MAD_highlim <dbl>,
## Outliers_low <lgl>, Outliers_high <lgl>
你也可以自己写一个函数,更简单灵活。使用整洁的评估方法,您将使用 rlang::sym()
将字符串转换为变量,然后在 summarise()
中使用 !!
取消引用它(砰砰)。
library(dplyr)
check_outlier <- function(df, .groupvar, .checkvar) {
.groupvar <- sym(.groupvar)
.checkvar <- sym(.checkvar)
df_outlier_check <- df %>%
dplyr::group_by(!! .groupvar) %>%
dplyr::summarise(min = min(!! .checkvar, na.rm = TRUE),
max = max(!! .checkvar, na.rm = TRUE),
median = median(!! .checkvar, na.rm = TRUE),
MAD = mad(!! .checkvar, na.rm = TRUE),
MAD_lowlim = median - (3 * MAD),
MAD_highlim = median + (3 * MAD),
Outliers_low = any(!! .checkvar < MAD_lowlim, na.rm = TRUE),
Outliers_high = any(!! .checkvar > MAD_highlim, na.rm = TRUE)
)
return(df_outlier_check)
}
# test function
check_outlier(iris, "Species", "Sepal.Length")
#> # A tibble: 3 x 9
#> Species min max median MAD MAD_lowlim MAD_highlim Outliers_low
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
#> 1 setosa 4.3 5.8 5 0.297 4.11 5.89 FALSE
#> 2 versic~ 4.9 7 5.9 0.519 4.34 7.46 FALSE
#> 3 virgin~ 4.9 7.9 6.5 0.593 4.72 8.28 FALSE
#> # ... with 1 more variable: Outliers_high <lgl>
遍历所有变量并使用 purrr::map_df()
library(purrr)
vars <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")
vars %>%
set_names() %>%
map_df(~ check_outlier(iris, "Species", .x), .id = 'Variable')
#> # A tibble: 12 x 10
#> Variable Species min max median MAD MAD_lowlim MAD_highlim
#> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Sepal.L~ setosa 4.3 5.8 5 0.297 4.11 5.89
#> 2 Sepal.L~ versic~ 4.9 7 5.9 0.519 4.34 7.46
#> 3 Sepal.L~ virgin~ 4.9 7.9 6.5 0.593 4.72 8.28
#> 4 Sepal.W~ setosa 2.3 4.4 3.4 0.371 2.29 4.51
#> 5 Sepal.W~ versic~ 2 3.4 2.8 0.297 1.91 3.69
#> 6 Sepal.W~ virgin~ 2.2 3.8 3 0.297 2.11 3.89
#> 7 Petal.L~ setosa 1 1.9 1.5 0.148 1.06 1.94
#> 8 Petal.L~ versic~ 3 5.1 4.35 0.519 2.79 5.91
#> 9 Petal.L~ virgin~ 4.5 6.9 5.55 0.667 3.55 7.55
#> 10 Petal.W~ setosa 0.1 0.6 0.2 0 0.2 0.2
#> 11 Petal.W~ versic~ 1 1.8 1.3 0.222 0.633 1.97
#> 12 Petal.W~ virgin~ 1.4 2.5 2 0.297 1.11 2.89
#> # ... with 2 more variables: Outliers_low <lgl>, Outliers_high <lgl>
由 reprex package (v0.2.1.9000)
创建于 2018-10-20您还可以创建这些 per-variable/species 没有循环或单独函数的摘要,只需 gather
非物种列、分组和总结:
library(tidyverse)
iris.summary <- iris %>%
gather(variable, value, -Species) %>%
group_by(variable, Species) %>%
summarize(
min = min(value, na.rm = TRUE),
max = max(value, na.rm = TRUE),
median = median(value, na.rm = TRUE),
MAD = mad(value, na.rm = TRUE),
MAD_lowlim = median - (3 * MAD),
MAD_highlim = median + (3 * MAD),
Outliers_low = any(value < MAD_lowlim, na.rm = TRUE),
Outliers_high = any(value > MAD_highlim, na.rm = TRUE)
)
variable Species min max median MAD MAD_lowlim MAD_highlim Outliers_low Outliers_high
<chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
1 Petal.Length setosa 1 1.9 1.5 0.148 1.06 1.94 TRUE FALSE
2 Petal.Length versicolor 3 5.1 4.35 0.519 2.79 5.91 FALSE FALSE
3 Petal.Length virginica 4.5 6.9 5.55 0.667 3.55 7.55 FALSE FALSE
4 Petal.Width setosa 0.1 0.6 0.2 0 0.2 0.2 TRUE TRUE
5 Petal.Width versicolor 1 1.8 1.3 0.222 0.633 1.97 FALSE FALSE
6 Petal.Width virginica 1.4 2.5 2 0.297 1.11 2.89 FALSE FALSE
7 Sepal.Length setosa 4.3 5.8 5 0.297 4.11 5.89 FALSE FALSE
8 Sepal.Length versicolor 4.9 7 5.9 0.519 4.34 7.46 FALSE FALSE
9 Sepal.Length virginica 4.9 7.9 6.5 0.593 4.72 8.28 FALSE FALSE
10 Sepal.Width setosa 2.3 4.4 3.4 0.371 2.29 4.51 FALSE FALSE
11 Sepal.Width versicolor 2 3.4 2.8 0.297 1.91 3.69 FALSE FALSE
12 Sepal.Width virginica 2.2 3.8 3 0.297 2.11 3.89 FALSE FALSE