dplyr:组均值居中(变异+总结)

dplyr: group mean centering (mutate + summarize)

用 dplyr 进行组均值居中的 efficient/preferred 方法是什么,即获取组中的每个元素 (mutate) 并对其执行操作和汇总统计信息 (summarize) 该组。以下是如何使用基数 R:

mtcars 为中心进行分组均值
do.call(rbind, lapply(split(mtcars, mtcars$cyl), function(x){ 
    x[["cent"]] <- x$mpg - mean(x$mpg)
    x
}))

你可以试试

library(dplyr)
mtcars %>%
      add_rownames()%>% #if the rownames are needed as a column
      group_by(cyl) %>% 
      mutate(cent= mpg-mean(mpg))

上面的代码似乎使用了全局均值来使 mpg 居中;想以组内均值为中心,即每个cyl组水平的均值不一样怎么办

> mtcars %>%
+   add_rownames()%>% #if the rownames are needed as a column
+   group_by(cyl) %>% 
+   mutate(cent= mpg-mean(mpg))%>%
+   dplyr ::select(cent)
Adding missing grouping variables: `cyl`
# A tibble: 32 x 2
# Groups:   cyl [3]
     cyl   cent
   <dbl>  <dbl>
 1     6  0.909
 2     6  0.909
 3     4  2.71 
 4     6  1.31 
 5     8 -1.39 
 6     6 -1.99 
 7     8 -5.79 
 8     4  4.31 
 9     4  2.71 
10     6 -0.891
# … with 22 more rows
Warning message:
Deprecated, use tibble::rownames_to_column() instead. 
> mtcars$mpg[1:5]-mean(mtcars$mpg)
[1]  0.909375  0.909375  2.709375  1.309375 -1.390625

您可以试试这个(尽管显示的新变量的名称不同):

mtcars %>%
  group_by(cyl) %>%
  mutate(gpcent = scale(mpg, scale = F))