如何获取按 R 中的分类变量值分层的列的统计信息

Question

我想根据Species列中的值得到数据集iris中不同列的mean()和sd():

> head(iris[order(runif(nrow(iris))), ])
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
50           5.0         3.3          1.4         0.2     setosa
111          6.5         3.2          5.1         2.0  virginica
69           6.2         2.2          4.5         1.5 versicolor
150          5.9         3.0          5.1         1.8  virginica

不区分这 3 个不同的物种，apply 就可以了：

> stats = apply(iris[ ,1:4], MARGIN = 2, function(x) rbind(mean(x), SD = sd(x))); row.names(stats) = c("mean", "sd"); stats
     Sepal.Length Sepal.Width Petal.Length Petal.Width
mean    5.8433333   3.0573333     3.758000   1.1993333
sd      0.8280661   0.4358663     1.765298   0.7622377

但是，我怎样才能得到一份列表（？），其中包含按物种分类的这些结果？

Answer 1

您可以使用拆分功能按物种拆分数据以获取数据帧列表

iris2 <- split(iris, iris$Species)
fun <- function(df){
stats = apply(df[ ,1:4], MARGIN = 2, function(x) rbind(mean(x), SD = sd(x)))
row.names(stats) = c("mean", "sd") 
return(stats)
}
lapply(iris2, fun)

Answer 2

aggregate就是你要找的函数：

> aggregate(. ~ Species, data = iris, FUN = mean)
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026
> aggregate(. ~ Species, data = iris, FUN = sd)
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa    0.3524897   0.3790644    0.1736640   0.1053856
2 versicolor    0.5161711   0.3137983    0.4699110   0.1977527
3  virginica    0.6358796   0.3224966    0.5518947   0.2746501

aggregate 根据一个因素或因素组合计算数据集上的函数。

Answer 3

这不是一个完整的答案（不是 return 列表并且不保持相同的 table 结构）。收录对于 dplyr 的认识非常有用summarize_all

library(dplyr)
df <- iris %>% group_by(Species) %>% summarise_all(funs(mean, sd)) 

# A tibble: 3 × 9
# Species Sepal.Length_mean Sepal.Width_mean Petal.Length_mean Petal.Width_mean Sepal.Length_sd Sepal.Width_sd
# <fctr>             <dbl>            <dbl>             <dbl>            <dbl>           <dbl>          <dbl>
# 1     setosa             5.006            3.428             1.462            0.246       0.3524897      0.3790644
# 2 versicolor             5.936            2.770             4.260            1.326       0.5161711      0.3137983
# 3  virginica             6.588            2.974             5.552            2.026       0.6358796      0.3224966
# ... with 2 more variables: Petal.Length_sd <dbl>, Petal.Width_sd <dbl>

Answer 4

另一种选择是data.table

library(data.table)
as.data.table(iris)[,unlist(lapply(.SD, function(x)
    list(Mean = mean(x), SD = sd(x))), recursive = FALSE), Species]
#     Species Sepal.Length.Mean Sepal.Length.SD Sepal.Width.Mean Sepal.Width.SD Petal.Length.Mean Petal.Length.SD Petal.Width.Mean
#1:     setosa             5.006       0.3524897            3.428      0.3790644             1.462       0.1736640            0.246
#2: versicolor             5.936       0.5161711            2.770      0.3137983             4.260       0.4699110            1.326
#3:  virginica             6.588       0.6358796            2.974      0.3224966             5.552       0.5518947            2.026
#   Petal.Width.SD
#1:      0.1053856
#2:      0.1977527
#3:      0.2746501

如何获取按 R 中的分类变量值分层的列的统计信息

How to get stats of columns stratified by values of a categorical variable in R

r

apply