使用 dplyr 按数字类别提取摘要统计信息

Distilling summary statistics by numerical categories with dplyr

我有一个包含数十列数据的大型(行 > 200000)数据框。我想提取这个数据框并总结变量在给定范围内的数据数量。

例如,如果我有一个类似于此的 data.frame

 set.seed(10)
 df <- data.frame( age = runif( n = 1000, min = 0, max = 4000 ),
                   size = rnorm( n = 1000, mean = 10, sd = 1 ),
                   shape = rnorm( n = 1000, mean = 1000, sd = 1000) )

我想对一系列年龄范围内的样本 number、平均值 sizeshape 以及中位数 sizeshape 来自每个年龄段的样本。

类似

 summary.df <- data.frame( age.group = seq( 0, 3900, by = 100 ),
                           number = (number of samples in age bin),
                           mean = ( mean of data in age bin ) )

等等

现在我正在通过为每个年龄段创建一个新的 data.frame 来非常直截了当地做到这一点。

data.1          <- subset( df, age > 0 & age <= 100 )
data.2          <- subset( df, age > 100 & age <= 200 )
data.3          <- subset( df, age > 200 & age <= 300 )

等 然后添加一个分类变量

data.1 <- data.frame( data.1, age.group = "100", count.row = nrow( data.1 ) )
data.2 <- data.frame( data.2, age.group = "200", count.row = nrow( data.2 ) )
data.3 <- data.frame( data.3, age.group = "300", count.row = nrow( data.3 ) )

将它们相加

data.big <- rbind( data.1, data.2, data.3 )

然后通过 dplyr

生成摘要统计信息
data.summary <- data.big %>%
   group_by( age.group ) %>%
   summarize( count.row = mean( count.row ),
         mean = mean( size, na.rm = TRUE ),
         median = median( size, na.rm = T ) )

我如何仅使用 dplyr 来更有效地执行此操作?我认为一定有办法,但我不能绕过它。

感谢您的帮助!

您可以利用cut将数据以100为间隔进行划分,并计算每组的汇总统计量。

library(dplyr)

df %>%
  group_by(age = cut(age, seq( 0, 4000, by = 100))) %>%
  summarise(mean = mean( size, na.rm = TRUE),
            median = median( size, na.rm = TRUE))

#   age          mean median
#   <fct>       <dbl>  <dbl>
# 1 (0,100]     10.0    9.92
# 2 (100,200]    9.88  10.2 
# 3 (200,300]   10.1   10.1 
# 4 (300,400]    9.83   9.80
# 5 (400,500]    9.95   9.72
# 6 (500,600]    9.68   9.78
# 7 (600,700]   10.2   10.5 
# 8 (700,800]   10.2   10.4 
# 9 (800,900]    9.68   9.47
#10 (900,1e+03]  9.80   9.81
# … with 30 more rows