计算测量的汇总统计数据并将它们旋转到 R 中的列

Calculate summary statistics of measurements and pivot them to columns in R

我有一个这样的数据框

Step <- c("1","1","4","3","2","2","3","4","4","3","1","3","2","4","3","1","2")
Length <- c(0.1,0.5,0.7,0.8,0.2,0.1,0.3,0.8,0.9,0.15,0.25,0.27,0.28,0.61,0.15,0.37,0.18)
Breadth <- c(0.13,0.35,0.87,0.38,0.52,0.71,0.43,0.8,0.9,0.15,0.45,0.7,0.8,0.11,0.11,0.47,0.28)
Height <- c(0.31,0.35,0.37,0.38,0.32,0.51,0.53,0.48,0.9,0.15,0.35,0.32,0.22,0.11,0.17,0.27,0.38)
Width <- c(0.21,0.25,0.27,0.8,0.2,0.21,0.3,0.28,0.29,0.65,0.55,0.37,0.26,0.31,0.5,0.7,0.8)

df <- data.frame(Step,Length,Breadth,Height,Width) 

我正在尝试计算按步骤分组的测量值的 最大值、最小值、平均值、中值、标准差,然后将那些具有测量值的列作为列。

我的期望输出

  Measurement max_1 min_1 mean_1 median_1       sd_1 max_2 min_2 mean_2 median_2       sd_2 max_3 min_3 mean_3 median_3      sd_3 max_4 min_4 mean_4 median_4       sd_4
       Length  0.50  0.10 0.3050     0.31 0.17058722  0.28  0.10 0.1900    0.190 0.07393691  0.80  0.15  0.334     0.27 0.2693139  0.90  0.61 0.7525    0.750 0.12526638
      Breadth  0.47  0.13 0.3500     0.40 0.15577760  0.80  0.28 0.5775    0.615 0.23012680  0.70  0.11  0.354     0.38 0.2383904  0.90  0.11 0.6700    0.835 0.37567720
       Height  0.35  0.27 0.3200     0.33 0.03829708  0.51  0.22 0.3575    0.350 0.12120919  0.53  0.15  0.310     0.32 0.1570032  0.90  0.11 0.4650    0.425 0.32888701
        Width  0.70  0.21 0.4275     0.40 0.23669601  0.80  0.20 0.3675    0.235 0.28952547  0.80  0.30  0.524     0.50 0.2040343  0.31  0.27 0.2875    0.285 0.01707825

我正在尝试通过这种方式来计算汇总统计数据,但这不是一种有效的方法。

library(dplyr)
df1 <- df %>%
  group_by(Step) %>%
  summarise(Length_Mean = mean(Length),
            Breadth_Mean = mean(Breadth),
            Height_Mean = mean(Height),
            Width_Mean = mean(Width))

如何使用最少的代码高效地完成我想要的输出?有人能给我指出正确的方向吗?

您可以使用 summarize"scoped" 版本来计算相同的摘要 一次统计多列。来自 ?scoped:

The variants suffixed with _if, _at or _all apply an expression (sometimes several) to all variables within a specified subset. This subset can contain all variables (_all variants), a vars() selection (_at variants), or variables selected with a predicate (_if variants).

这里summarize_all可能是个不错的选择;它选择除 对于分组 columns.You 也可以提供几个汇总函数 计算选择中的每个变量。

library(tidyverse)

# Calculate the summary statistics
sums <- df %>% 
  group_by(Step) %>% 
  summarize_all(funs(max, min, mean, median, sd))

sums
#> # A tibble: 4 x 21
#>   Step  Length_max Breadth_max Height_max Width_max Length_min Breadth_min
#>   <fct>      <dbl>       <dbl>      <dbl>     <dbl>      <dbl>       <dbl>
#> 1 1           0.5         0.47       0.35      0.7        0.1         0.13
#> 2 2           0.28        0.8        0.51      0.8        0.1         0.28
#> 3 3           0.8         0.7        0.53      0.8        0.15        0.11
#> 4 4           0.9         0.9        0.9       0.31       0.61        0.11
#> # ... with 14 more variables: Height_min <dbl>, Width_min <dbl>,
#> #   Length_mean <dbl>, Breadth_mean <dbl>, Height_mean <dbl>,
#> #   Width_mean <dbl>, Length_median <dbl>, Breadth_median <dbl>,
#> #   Height_median <dbl>, Width_median <dbl>, Length_sd <dbl>,
#> #   Breadth_sd <dbl>, Height_sd <dbl>, Width_sd <dbl>

现在我们有了汇总统计数据,剩下要做的就是 重塑数据以获得所需的输出。为此,gatherspread、 来自 tidyrseparateunite 派上用场了:

sums %>% 
  # Reshape to long format
  gather(col, val, -Step) %>% 
  # Separate the measurement and the summary statistic
  separate(col, into = c("Measurement", "stat")) %>% 
  arrange(Step) %>% 
  # Create the desired column headings
  unite(col, stat, Step) %>% 
  # Need to use factors to preserve order
  mutate_at(vars(col, Measurement), fct_inorder) %>% 
  # Reshape back to wide format
  spread(col, val)
#> # A tibble: 4 x 21
#>   Measurement max_1 min_1 mean_1 median_1   sd_1 max_2 min_2 mean_2
#>   <fct>       <dbl> <dbl>  <dbl>    <dbl>  <dbl> <dbl> <dbl>  <dbl>
#> 1 Length       0.5   0.1   0.305    0.31  0.171   0.28  0.1   0.19 
#> 2 Breadth      0.47  0.13  0.35     0.4   0.156   0.8   0.28  0.578
#> 3 Height       0.35  0.27  0.32     0.330 0.0383  0.51  0.22  0.358
#> 4 Width        0.7   0.21  0.428    0.4   0.237   0.8   0.2   0.368
#> # ... with 12 more variables: median_2 <dbl>, sd_2 <dbl>, max_3 <dbl>,
#> #   min_3 <dbl>, mean_3 <dbl>, median_3 <dbl>, sd_3 <dbl>, max_4 <dbl>,
#> #   min_4 <dbl>, mean_4 <dbl>, median_4 <dbl>, sd_4 <dbl>

reprex package (v0.2.0) 创建于 2018-05-24。