按列分组,然后计算 R 中所有其他列的均值和标准差
Group by columns, then compute mean and sd of every other column in R
如何按列分组,然后计算 R 中所有其他列的均值和标准差?
举个例子,考虑一下著名的 Iris 数据集。
然后计算 petal/sepal length/width 测量值的平均值和标准差。
我知道这与 split-apply-combine 有关,
x <- ddply(iris, .(Species), summarise,
Sepal.Length.Mean = mean(Sepal.Length),
Sepal.Length.Sd = sd(Sepal.Length),
Sepal.Width.Mean = mean(Sepal.Width),
Sepal.Width.Sd = sd(Sepal.Width),
Petal.Length.Mean = mean(Petal.Length),
Petal.Length.Sd = sd(Petal.Length),
Petal.Width.Mean = mean(Petal.Width),
Petal.Width.Sd = sd(Petal.Width))
Species Sepal.Length.Mean Sepal.Length.Sd Sepal.Width.Mean Sepal.Width.Sd
1 setosa 5.006 0.3524897 3.428 0.3790644
2 versicolor 5.936 0.5161711 2.770 0.3137983
3 virginica 6.588 0.6358796 2.974 0.3224966
Petal.Length.Mean Petal.Length.Sd Petal.Width.Mean Petal.Width.Sd
1 1.462 0.1736640 0.246 0.1053856
2 4.260 0.4699110 1.326 0.1977527
3 5.552 0.5518947 2.026 0.2746501
z <- data.frame(setosa = c(5.006, 0.3524897, 3.428, 0.3790644,
1.462, 0.1736640, 0.246, 0.1053856),
versicolor = c(5.936, 0.5161711, 2.770, 0.3137983,
4.260, 0.4699110, 1.326, 0.1977527),
virginica = c(6.588, 0.6358796, 2.974, 0.3225966,
5.552, 0.5518947, 2.026, 0.2746501))
rownames(z) <- c('Sepal.Length.Mean', 'Sepal.Length.Sd',
'Sepal.Width.Mean', 'Sepal.Width.Sd',
'Petal.Length.Mean', 'Petal.Length.Sd',
'Petal.Width.Mean', 'Petal.Width.Sd')
setosa versicolor virginica
Sepal.Length.Mean 5.0060000 5.9360000 6.5880000
Sepal.Length.Sd 0.3524897 0.5161711 0.6358796
Sepal.Width.Mean 3.4280000 2.7700000 2.9740000
Sepal.Width.Sd 0.3790644 0.3137983 0.3225966
Petal.Length.Mean 1.4620000 4.2600000 5.5520000
Petal.Length.Sd 0.1736640 0.4699110 0.5518947
Petal.Width.Mean 0.2460000 1.3260000 2.0260000
Petal.Width.Sd 0.1053856 0.1977527 0.2746501
res <- iris %>%
group_by(Species) %>%
summarise_each(funs(mean, sd))
`colnames<-`(t(res[-1]), as.character(res$Species))
# setosa versicolor virginica
#Sepal.Length_mean 5.0060000 5.9360000 6.5880000
#Sepal.Width_mean 3.4280000 2.7700000 2.9740000
#Petal.Length_mean 1.4620000 4.2600000 5.5520000
#Petal.Width_mean 0.2460000 1.3260000 2.0260000
#Sepal.Length_sd 0.3524897 0.5161711 0.6358796
#Sepal.Width_sd 0.3790644 0.3137983 0.3224966
#Petal.Length_sd 0.1736640 0.4699110 0.5518947
#Petal.Width_sd 0.1053856 0.1977527 0.2746501
或者正如@Steven Beaupre 在评论中提到的那样,可以通过 spread
iris %>%
group_by(Species) %>%
summarise_each(funs(mean, sd)) %>%
gather(key, value, -Species) %>%
spread(Species, value)
这是传统的 plyr
方法。它使用 colwise
means <- ddply(iris, .(Species), colwise(mean))
sds <- ddply(iris, .(Species), colwise(sd))
merge(means, sds, by = "Species", suffixes = c(".mean", ".sd"))
如果你想使用 data.table
(不要害怕 - 评论比代码多;-)我已经尝试优化所有性能关键点。
dt <- as.data.table(iris)
# Helper function similar to "colwise" of package "plyr":
# Apply a function "func" to each column of the data.table "data"
# and append the "suffix" string to the result column name.
colwise.dt <- function( data, func, suffix )
result <- lapply(data, func) # apply the function to each column of the data table
setDT(result) # convert the result list into a data table efficiently ("by ref")
setnames(result, names(result), paste0(names(result), suffix)) # append suffix to each column name efficiently ("by ref"). "setnames" requires a data.table
wide.result <- dt[, c(colwise.dt(.SD, mean, ".mean"), colwise.dt(.SD, sd, ".sd")), by=.(Species)]
# Note: .SD is a data.table containing the subset of dt's data for each group (Species), excluding any columns used in "by" (here: Species column)
# Now transpose the result
long.result <- melt(wide.result, id.vars="Species")
# Now transform into one column per group
final.result <- dcast(long.result, variable ~ Species)
Species Sepal.Length.mean Sepal.Width.mean Petal.Length.mean Petal.Width.mean Sepal.Length.sd Sepal.Width.sd Petal.Length.sd Petal.Width.sd
1: setosa 5.006 3.428 1.462 0.246 0.3524897 0.3790644 0.1736640 0.1053856
2: versicolor 5.936 2.770 4.260 1.326 0.5161711 0.3137983 0.4699110 0.1977527
3: virginica 6.588 2.974 5.552 2.026 0.6358796 0.3224966 0.5518947 0.2746501
Species variable value
1: setosa Sepal.Length.mean 5.0060000
2: versicolor Sepal.Length.mean 5.9360000
3: virginica Sepal.Length.mean 6.5880000
4: setosa Sepal.Width.mean 3.4280000
5: versicolor Sepal.Width.mean 2.7700000
6: virginica Sepal.Width.mean 2.9740000
7: setosa Petal.Length.mean 1.4620000
8: versicolor Petal.Length.mean 4.2600000
9: virginica Petal.Length.mean 5.5520000
10: setosa Petal.Width.mean 0.2460000
11: versicolor Petal.Width.mean 1.3260000
12: virginica Petal.Width.mean 2.0260000
13: setosa Sepal.Length.sd 0.3524897
14: versicolor Sepal.Length.sd 0.5161711
15: virginica Sepal.Length.sd 0.6358796
16: setosa Sepal.Width.sd 0.3790644
17: versicolor Sepal.Width.sd 0.3137983
18: virginica Sepal.Width.sd 0.3224966
19: setosa Petal.Length.sd 0.1736640
20: versicolor Petal.Length.sd 0.4699110
21: virginica Petal.Length.sd 0.5518947
22: setosa Petal.Width.sd 0.1053856
23: versicolor Petal.Width.sd 0.1977527
24: virginica Petal.Width.sd 0.2746501
variable setosa versicolor virginica
1: Sepal.Length.mean 5.0060000 5.9360000 6.5880000
2: Sepal.Width.mean 3.4280000 2.7700000 2.9740000
3: Petal.Length.mean 1.4620000 4.2600000 5.5520000
4: Petal.Width.mean 0.2460000 1.3260000 2.0260000
5: Sepal.Length.sd 0.3524897 0.5161711 0.6358796
6: Sepal.Width.sd 0.3790644 0.3137983 0.3224966
7: Petal.Length.sd 0.1736640 0.4699110 0.5518947
8: Petal.Width.sd 0.1053856 0.1977527 0.2746501
与所需输出的唯一区别是 final
结果在名为 variable
仅使用 dplyr
和 tidyr
x <- iris %>%
gather(var, value, -Species)
# Compute the mean and sd for each dimension
x <- x %>%
group_by(Species, var) %>%
summarise(mean = mean(value), sd = sd(value)) %>%
# Convert the data frame from wide form to long form
x <- x %>%
gather(stat, value, mean:sd)
# Combine the variables "var" and "stat" into a single variable
x <- x %>%
unite(var, var, stat, sep = '.')
# Convert the data frame from long form to wide form
x <- x %>%
spread(Species, value)
如何按列分组,然后计算 R 中所有其他列的均值和标准差?
举个例子,考虑一下著名的 Iris 数据集。 我想做一些类似于按物种分组的事情, 然后计算 petal/sepal length/width 测量值的平均值和标准差。 我知道这与 split-apply-combine 有关, 但我不确定如何从那里开始。
x <- ddply(iris, .(Species), summarise,
Sepal.Length.Mean = mean(Sepal.Length),
Sepal.Length.Sd = sd(Sepal.Length),
Sepal.Width.Mean = mean(Sepal.Width),
Sepal.Width.Sd = sd(Sepal.Width),
Petal.Length.Mean = mean(Petal.Length),
Petal.Length.Sd = sd(Petal.Length),
Petal.Width.Mean = mean(Petal.Width),
Petal.Width.Sd = sd(Petal.Width))
Species Sepal.Length.Mean Sepal.Length.Sd Sepal.Width.Mean Sepal.Width.Sd
1 setosa 5.006 0.3524897 3.428 0.3790644
2 versicolor 5.936 0.5161711 2.770 0.3137983
3 virginica 6.588 0.6358796 2.974 0.3224966
Petal.Length.Mean Petal.Length.Sd Petal.Width.Mean Petal.Width.Sd
1 1.462 0.1736640 0.246 0.1053856
2 4.260 0.4699110 1.326 0.1977527
3 5.552 0.5518947 2.026 0.2746501
z <- data.frame(setosa = c(5.006, 0.3524897, 3.428, 0.3790644,
1.462, 0.1736640, 0.246, 0.1053856),
versicolor = c(5.936, 0.5161711, 2.770, 0.3137983,
4.260, 0.4699110, 1.326, 0.1977527),
virginica = c(6.588, 0.6358796, 2.974, 0.3225966,
5.552, 0.5518947, 2.026, 0.2746501))
rownames(z) <- c('Sepal.Length.Mean', 'Sepal.Length.Sd',
'Sepal.Width.Mean', 'Sepal.Width.Sd',
'Petal.Length.Mean', 'Petal.Length.Sd',
'Petal.Width.Mean', 'Petal.Width.Sd')
setosa versicolor virginica
Sepal.Length.Mean 5.0060000 5.9360000 6.5880000
Sepal.Length.Sd 0.3524897 0.5161711 0.6358796
Sepal.Width.Mean 3.4280000 2.7700000 2.9740000
Sepal.Width.Sd 0.3790644 0.3137983 0.3225966
Petal.Length.Mean 1.4620000 4.2600000 5.5520000
Petal.Length.Sd 0.1736640 0.4699110 0.5518947
Petal.Width.Mean 0.2460000 1.3260000 2.0260000
Petal.Width.Sd 0.1053856 0.1977527 0.2746501
res <- iris %>%
group_by(Species) %>%
summarise_each(funs(mean, sd))
`colnames<-`(t(res[-1]), as.character(res$Species))
# setosa versicolor virginica
#Sepal.Length_mean 5.0060000 5.9360000 6.5880000
#Sepal.Width_mean 3.4280000 2.7700000 2.9740000
#Petal.Length_mean 1.4620000 4.2600000 5.5520000
#Petal.Width_mean 0.2460000 1.3260000 2.0260000
#Sepal.Length_sd 0.3524897 0.5161711 0.6358796
#Sepal.Width_sd 0.3790644 0.3137983 0.3224966
#Petal.Length_sd 0.1736640 0.4699110 0.5518947
#Petal.Width_sd 0.1053856 0.1977527 0.2746501
或者正如@Steven Beaupre 在评论中提到的那样,可以通过 spread
iris %>%
group_by(Species) %>%
summarise_each(funs(mean, sd)) %>%
gather(key, value, -Species) %>%
spread(Species, value)
这是传统的 plyr
方法。它使用 colwise
means <- ddply(iris, .(Species), colwise(mean))
sds <- ddply(iris, .(Species), colwise(sd))
merge(means, sds, by = "Species", suffixes = c(".mean", ".sd"))
如果你想使用 data.table
(不要害怕 - 评论比代码多;-)我已经尝试优化所有性能关键点。
dt <- as.data.table(iris)
# Helper function similar to "colwise" of package "plyr":
# Apply a function "func" to each column of the data.table "data"
# and append the "suffix" string to the result column name.
colwise.dt <- function( data, func, suffix )
result <- lapply(data, func) # apply the function to each column of the data table
setDT(result) # convert the result list into a data table efficiently ("by ref")
setnames(result, names(result), paste0(names(result), suffix)) # append suffix to each column name efficiently ("by ref"). "setnames" requires a data.table
wide.result <- dt[, c(colwise.dt(.SD, mean, ".mean"), colwise.dt(.SD, sd, ".sd")), by=.(Species)]
# Note: .SD is a data.table containing the subset of dt's data for each group (Species), excluding any columns used in "by" (here: Species column)
# Now transpose the result
long.result <- melt(wide.result, id.vars="Species")
# Now transform into one column per group
final.result <- dcast(long.result, variable ~ Species)
Species Sepal.Length.mean Sepal.Width.mean Petal.Length.mean Petal.Width.mean Sepal.Length.sd Sepal.Width.sd Petal.Length.sd Petal.Width.sd
1: setosa 5.006 3.428 1.462 0.246 0.3524897 0.3790644 0.1736640 0.1053856
2: versicolor 5.936 2.770 4.260 1.326 0.5161711 0.3137983 0.4699110 0.1977527
3: virginica 6.588 2.974 5.552 2.026 0.6358796 0.3224966 0.5518947 0.2746501
Species variable value
1: setosa Sepal.Length.mean 5.0060000
2: versicolor Sepal.Length.mean 5.9360000
3: virginica Sepal.Length.mean 6.5880000
4: setosa Sepal.Width.mean 3.4280000
5: versicolor Sepal.Width.mean 2.7700000
6: virginica Sepal.Width.mean 2.9740000
7: setosa Petal.Length.mean 1.4620000
8: versicolor Petal.Length.mean 4.2600000
9: virginica Petal.Length.mean 5.5520000
10: setosa Petal.Width.mean 0.2460000
11: versicolor Petal.Width.mean 1.3260000
12: virginica Petal.Width.mean 2.0260000
13: setosa Sepal.Length.sd 0.3524897
14: versicolor Sepal.Length.sd 0.5161711
15: virginica Sepal.Length.sd 0.6358796
16: setosa Sepal.Width.sd 0.3790644
17: versicolor Sepal.Width.sd 0.3137983
18: virginica Sepal.Width.sd 0.3224966
19: setosa Petal.Length.sd 0.1736640
20: versicolor Petal.Length.sd 0.4699110
21: virginica Petal.Length.sd 0.5518947
22: setosa Petal.Width.sd 0.1053856
23: versicolor Petal.Width.sd 0.1977527
24: virginica Petal.Width.sd 0.2746501
variable setosa versicolor virginica
1: Sepal.Length.mean 5.0060000 5.9360000 6.5880000
2: Sepal.Width.mean 3.4280000 2.7700000 2.9740000
3: Petal.Length.mean 1.4620000 4.2600000 5.5520000
4: Petal.Width.mean 0.2460000 1.3260000 2.0260000
5: Sepal.Length.sd 0.3524897 0.5161711 0.6358796
6: Sepal.Width.sd 0.3790644 0.3137983 0.3224966
7: Petal.Length.sd 0.1736640 0.4699110 0.5518947
8: Petal.Width.sd 0.1053856 0.1977527 0.2746501
与所需输出的唯一区别是 final
结果在名为 variable
仅使用 dplyr
和 tidyr
x <- iris %>%
gather(var, value, -Species)
# Compute the mean and sd for each dimension
x <- x %>%
group_by(Species, var) %>%
summarise(mean = mean(value), sd = sd(value)) %>%
# Convert the data frame from wide form to long form
x <- x %>%
gather(stat, value, mean:sd)
# Combine the variables "var" and "stat" into a single variable
x <- x %>%
unite(var, var, stat, sep = '.')
# Convert the data frame from long form to wide form
x <- x %>%
spread(Species, value)