如何获取按 R 中的分类变量值分层的列的统计信息
How to get stats of columns stratified by values of a categorical variable in R
我想根据Species
列中的值得到数据集iris
中不同列的mean()
和sd()
:
> head(iris[order(runif(nrow(iris))), ])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
50 5.0 3.3 1.4 0.2 setosa
111 6.5 3.2 5.1 2.0 virginica
69 6.2 2.2 4.5 1.5 versicolor
150 5.9 3.0 5.1 1.8 virginica
不区分这 3 个不同的物种,apply
就可以了:
> stats = apply(iris[ ,1:4], MARGIN = 2, function(x) rbind(mean(x), SD = sd(x))); row.names(stats) = c("mean", "sd"); stats
Sepal.Length Sepal.Width Petal.Length Petal.Width
mean 5.8433333 3.0573333 3.758000 1.1993333
sd 0.8280661 0.4358663 1.765298 0.7622377
但是,我怎样才能得到一份列表(?),其中包含按物种分类的这些结果?
您可以使用拆分功能按物种拆分数据以获取数据帧列表
iris2 <- split(iris, iris$Species)
fun <- function(df){
stats = apply(df[ ,1:4], MARGIN = 2, function(x) rbind(mean(x), SD = sd(x)))
row.names(stats) = c("mean", "sd")
return(stats)
}
lapply(iris2, fun)
aggregate
就是你要找的函数:
> aggregate(. ~ Species, data = iris, FUN = mean)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
> aggregate(. ~ Species, data = iris, FUN = sd)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 0.3524897 0.3790644 0.1736640 0.1053856
2 versicolor 0.5161711 0.3137983 0.4699110 0.1977527
3 virginica 0.6358796 0.3224966 0.5518947 0.2746501
aggregate
根据一个因素或因素组合计算数据集上的函数。
这不是一个完整的答案(不是 return 列表并且不保持相同的 table 结构)。收录对于 dplyr 的认识非常有用summarize_all
library(dplyr)
df <- iris %>% group_by(Species) %>% summarise_all(funs(mean, sd))
# A tibble: 3 × 9
# Species Sepal.Length_mean Sepal.Width_mean Petal.Length_mean Petal.Width_mean Sepal.Length_sd Sepal.Width_sd
# <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 setosa 5.006 3.428 1.462 0.246 0.3524897 0.3790644
# 2 versicolor 5.936 2.770 4.260 1.326 0.5161711 0.3137983
# 3 virginica 6.588 2.974 5.552 2.026 0.6358796 0.3224966
# ... with 2 more variables: Petal.Length_sd <dbl>, Petal.Width_sd <dbl>
另一种选择是data.table
library(data.table)
as.data.table(iris)[,unlist(lapply(.SD, function(x)
list(Mean = mean(x), SD = sd(x))), recursive = FALSE), Species]
# Species Sepal.Length.Mean Sepal.Length.SD Sepal.Width.Mean Sepal.Width.SD Petal.Length.Mean Petal.Length.SD Petal.Width.Mean
#1: setosa 5.006 0.3524897 3.428 0.3790644 1.462 0.1736640 0.246
#2: versicolor 5.936 0.5161711 2.770 0.3137983 4.260 0.4699110 1.326
#3: virginica 6.588 0.6358796 2.974 0.3224966 5.552 0.5518947 2.026
# Petal.Width.SD
#1: 0.1053856
#2: 0.1977527
#3: 0.2746501
我想根据Species
列中的值得到数据集iris
中不同列的mean()
和sd()
:
> head(iris[order(runif(nrow(iris))), ])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
50 5.0 3.3 1.4 0.2 setosa
111 6.5 3.2 5.1 2.0 virginica
69 6.2 2.2 4.5 1.5 versicolor
150 5.9 3.0 5.1 1.8 virginica
不区分这 3 个不同的物种,apply
就可以了:
> stats = apply(iris[ ,1:4], MARGIN = 2, function(x) rbind(mean(x), SD = sd(x))); row.names(stats) = c("mean", "sd"); stats
Sepal.Length Sepal.Width Petal.Length Petal.Width
mean 5.8433333 3.0573333 3.758000 1.1993333
sd 0.8280661 0.4358663 1.765298 0.7622377
但是,我怎样才能得到一份列表(?),其中包含按物种分类的这些结果?
您可以使用拆分功能按物种拆分数据以获取数据帧列表
iris2 <- split(iris, iris$Species)
fun <- function(df){
stats = apply(df[ ,1:4], MARGIN = 2, function(x) rbind(mean(x), SD = sd(x)))
row.names(stats) = c("mean", "sd")
return(stats)
}
lapply(iris2, fun)
aggregate
就是你要找的函数:
> aggregate(. ~ Species, data = iris, FUN = mean)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
> aggregate(. ~ Species, data = iris, FUN = sd)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 0.3524897 0.3790644 0.1736640 0.1053856
2 versicolor 0.5161711 0.3137983 0.4699110 0.1977527
3 virginica 0.6358796 0.3224966 0.5518947 0.2746501
aggregate
根据一个因素或因素组合计算数据集上的函数。
这不是一个完整的答案(不是 return 列表并且不保持相同的 table 结构)。收录对于 dplyr 的认识非常有用summarize_all
library(dplyr)
df <- iris %>% group_by(Species) %>% summarise_all(funs(mean, sd))
# A tibble: 3 × 9
# Species Sepal.Length_mean Sepal.Width_mean Petal.Length_mean Petal.Width_mean Sepal.Length_sd Sepal.Width_sd
# <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 setosa 5.006 3.428 1.462 0.246 0.3524897 0.3790644
# 2 versicolor 5.936 2.770 4.260 1.326 0.5161711 0.3137983
# 3 virginica 6.588 2.974 5.552 2.026 0.6358796 0.3224966
# ... with 2 more variables: Petal.Length_sd <dbl>, Petal.Width_sd <dbl>
另一种选择是data.table
library(data.table)
as.data.table(iris)[,unlist(lapply(.SD, function(x)
list(Mean = mean(x), SD = sd(x))), recursive = FALSE), Species]
# Species Sepal.Length.Mean Sepal.Length.SD Sepal.Width.Mean Sepal.Width.SD Petal.Length.Mean Petal.Length.SD Petal.Width.Mean
#1: setosa 5.006 0.3524897 3.428 0.3790644 1.462 0.1736640 0.246
#2: versicolor 5.936 0.5161711 2.770 0.3137983 4.260 0.4699110 1.326
#3: virginica 6.588 0.6358796 2.974 0.3224966 5.552 0.5518947 2.026
# Petal.Width.SD
#1: 0.1053856
#2: 0.1977527
#3: 0.2746501