如何在保持其他列不变的情况下,基于多个组聚合和平均各种行

How to aggregate and average various rows based on multiple groups while keeping other columns intact

我花了几个小时试图找出如何做到这一点,但我不确定最好的方法是什么。我有物种和环境数据,每个站点有 2 个重复(每个特定的日期和月份),我想合并在每个特定站点、月份和日期的 'S' 和 'E' 拖车收集的数据。我正在做一些分析,想合并 'S' 和 'E' 拖车,因此每个站点的两个拖车只有一行数据(按天和按月)。我不确定如何很好地口头解释这一点,所以我会尝试展示一个例子来更好地解释我自己。

这是我的数据的简化版本:

structure(list(month = c("11", "11", "11", "11", "11", "11", 
"7", "7", "7", "7", "7", "7", "8", "8", "8", "8"), day = c("4", 
"4", "4", "4", "5", "5", "20", "20", "27", "27", "27", "27", 
"16", "16", "16", "16"), Site = c(11L, 11L, 6L, 6L, 9L, 9L, 10L, 
10L, 13L, 13L, 2L, 3L, 4L, 5L, 5L, 6L), Tow = c("E", "S", "E", 
"S", "E", "S", "E", "S", "E", "S", "S", "S", "S", "E", "S", "E"
), Depth = c(10L, 11L, 22L, 22L, 12L, 13L, 13L, 13L, 19L, 19L, 
14L, 21L, 22L, 22L, 22L, 22L), Temp = c(12.75, 12.9, 14.25, 14.239, 
12.975, 12.955, 23.804, 23.804, 23.89, 23.9, 24.41, 24.04, 23.915, 
23.988, 24.021, 23.957), DO_mgL = c(10.54, 10.45, 10.16, 10.12, 
10.4, 10.39, 7.24, 7.11, 8.07, 8.1, 9.14, 1.29, 2.44, 2.45, 2.48, 
2.54), secchi = c(1.25, 1.25, 2.25, 2.25, 1.5, 1.5, 2.7, 2.7, 
2.1, 2.1, 2.75, 1.25, 2.8, 3, 3, 3.25), d.lept = c(0, 0, 0, 0, 
0, 0, 0.008037479, 0.155240934, 0.128494423, 0.025249815, 0.053921767, 
0.012391113, 0.069338871, 0.022259485, 0.013767903, 0.046661095
), d.byths = c(0, 0, 0, 0, 0, 0, 0, 0.007392425, 0, 0, 0, 0, 
0, 0.044518969, 0.013767903, 0.015553698), d.daph = c(0.140036552, 
1.010093452, 1.629907953, 2.762608821, 1.130338642, 1.311853781, 
0.031419235, 0.029569702, 0.0525659, 0.084166051, 0.024509894, 
0.049564452, 0.104008307, 0.133556908, 0.082607421, 0.062214794
)), row.names = c(1L, 2L, 3L, 4L, 21L, 22L, 23L, 24L, 33L, 34L, 
35L, 36L, 58L, 59L, 60L, 61L), class = "data.frame")

例如,我的前两行(月 == 11,日 == 4,站点 == 11)应该以 1 行结尾,其中 'Temp' 和 'DO_mgL'行平均后,'secchi' 读数对于 'S' 和 'E' 丝束总是相同的,因此应该保持相同并且物种密度应该相加(相加)。完成后可以移除牵引柱。我想以这样的方式结束(只是展示前两行最终应该是什么样子的例子)。

month day Site Depth Temp DO_mgL secchi d.lept d.byths d.daph
11 4 11 10.5 12.83 10.50 1.25 0 0 1.15013

老实说,我什至不确定从哪里开始实现这一目标。执行以下操作在某种程度上实现了我对我的物种的要求,但这一次只针对一个物种(我总共有 8 个物种,在此示例中缩短)并删除其他列:

aggregate(d.lept ~ month + day + Site, data=zp1, FUN = sum)

同样,我需要将 'S' 和 'E' 拖车作为一个集合处理:

更复杂的是,由于 time/weather 现场的限制,有时我们无法收集重复数据,因此某些站点只有 'S' 丝束的数据,这些应该保持原样,因为那里对于那些特定的 sites/day/months.

只有一行

我的整个数据集有 97 行和 16 列。我在 7 月、8 月和 9 月总共采样了 24 个站点。我有 8 个物种及其相关密度(来自计数)。

我查看了以下关于我的问题的总和部分的帖子,但它们对我的帮助不大:, , and

我希望这是清楚且有道理的,但我很乐意提供进一步的说明。谢谢你的时间。

您所要求的可以使用标准 group_by + summarise 与 tidyverse 的组合来完成。多读书可以found here.

library(tidyverse)
library(janitor)
#> 
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#> 
#>     chisq.test, fisher.test

df <- structure(list(month = c("11", "11", "11", "11", "11", "11", 
                               "7", "7", "7", "7", "7", "7", "8", "8", "8", "8"), day = c("4", 
                                                                                          "4", "4", "4", "5", "5", "20", "20", "27", "27", "27", "27", 
                                                                                          "16", "16", "16", "16"), Site = c(11L, 11L, 6L, 6L, 9L, 9L, 10L, 
                                                                                                                            10L, 13L, 13L, 2L, 3L, 4L, 5L, 5L, 6L), Tow = c("E", "S", "E", 
                                                                                                                                                                            "S", "E", "S", "E", "S", "E", "S", "S", "S", "S", "E", "S", "E"
                                                                                                                            ), Depth = c(10L, 11L, 22L, 22L, 12L, 13L, 13L, 13L, 19L, 19L, 
                                                                                                                                         14L, 21L, 22L, 22L, 22L, 22L), Temp = c(12.75, 12.9, 14.25, 14.239, 
                                                                                                                                                                                 12.975, 12.955, 23.804, 23.804, 23.89, 23.9, 24.41, 24.04, 23.915, 
                                                                                                                                                                                 23.988, 24.021, 23.957), DO_mgL = c(10.54, 10.45, 10.16, 10.12, 
                                                                                                                                                                                                                     10.4, 10.39, 7.24, 7.11, 8.07, 8.1, 9.14, 1.29, 2.44, 2.45, 2.48, 
                                                                                                                                                                                                                     2.54), secchi = c(1.25, 1.25, 2.25, 2.25, 1.5, 1.5, 2.7, 2.7, 
                                                                                                                                                                                                                                       2.1, 2.1, 2.75, 1.25, 2.8, 3, 3, 3.25), d.lept = c(0, 0, 0, 0, 
                                                                                                                                                                                                                                                                                          0, 0, 0.008037479, 0.155240934, 0.128494423, 0.025249815, 0.053921767, 
                                                                                                                                                                                                                                                                                          0.012391113, 0.069338871, 0.022259485, 0.013767903, 0.046661095
                                                                                                                                                                                                                                       ), d.byths = c(0, 0, 0, 0, 0, 0, 0, 0.007392425, 0, 0, 0, 0, 
                                                                                                                                                                                                                                                      0, 0.044518969, 0.013767903, 0.015553698), d.daph = c(0.140036552, 
                                                                                                                                                                                                                                                                                                            1.010093452, 1.629907953, 2.762608821, 1.130338642, 1.311853781, 
                                                                                                                                                                                                                                                                                                            0.031419235, 0.029569702, 0.0525659, 0.084166051, 0.024509894, 
                                                                                                                                                                                                                                                                                                            0.049564452, 0.104008307, 0.133556908, 0.082607421, 0.062214794
                                                                                                                                                                                                                                                      )), row.names = c(1L, 2L, 3L, 4L, 21L, 22L, 23L, 24L, 33L, 34L, 
                                                                                                                                                                                                                                                                        35L, 36L, 58L, 59L, 60L, 61L), class = "data.frame") %>% 
  clean_names()

df %>% 
  group_by(month, day, site) %>% 
  summarise(avg_temp = mean(temp),
            avg_do_mg_l = mean(do_mg_l),
            secchi = secchi,
            sum_d_lept = sum(d_lept),
            sum_d_byths = sum(d_byths),
            sum_d_daph = sum(d_daph),
            .groups = "drop") %>% 
  distinct(month, day, site, .keep_all = TRUE)
#> # A tibble: 10 x 9
#>    month day    site avg_temp avg_do_mg_l secchi sum_d_lept sum_d_byths
#>    <chr> <chr> <int>    <dbl>       <dbl>  <dbl>      <dbl>       <dbl>
#>  1 11    4         6     14.2       10.1    2.25     0          0      
#>  2 11    4        11     12.8       10.5    1.25     0          0      
#>  3 11    5         9     13.0       10.4    1.5      0          0      
#>  4 7     20       10     23.8        7.18   2.7      0.163      0.00739
#>  5 7     27        2     24.4        9.14   2.75     0.0539     0      
#>  6 7     27        3     24.0        1.29   1.25     0.0124     0      
#>  7 7     27       13     23.9        8.09   2.1      0.154      0      
#>  8 8     16        4     23.9        2.44   2.8      0.0693     0      
#>  9 8     16        5     24.0        2.46   3        0.0360     0.0583 
#> 10 8     16        6     24.0        2.54   3.25     0.0467     0.0156 
#> # ... with 1 more variable: sum_d_daph <dbl>

reprex package (v2.0.1)

于 2022-04-29 创建

听起来您想将每组中的行折叠成一行 (??)。

data.table:

library(data.table)
##
#
setDT(df)[, .(
  Temp    = mean(Temp),
  DO_mgL  = mean(DO_mgL),
  secchi  = mean(secchi),
  d.lept  = sum(d.lept),
  d.byths = sum(d.byths),
  d.daph  = sum(d.daph)
), by=.(month, day, Site)]

##     month day Site    Temp DO_mgL secchi     d.lept     d.byths     d.daph
##  1:    11   4   11 12.8250 10.495   1.25 0.00000000 0.000000000 1.15013000
##  2:    11   4    6 14.2445 10.140   2.25 0.00000000 0.000000000 4.39251677
##  3:    11   5    9 12.9650 10.395   1.50 0.00000000 0.000000000 2.44219242
##  4:     7  20   10 23.8040  7.175   2.70 0.16327841 0.007392425 0.06098894
##  5:     7  27   13 23.8950  8.085   2.10 0.15374424 0.000000000 0.13673195
##  6:     7  27    2 24.4100  9.140   2.75 0.05392177 0.000000000 0.02450989
##  7:     7  27    3 24.0400  1.290   1.25 0.01239111 0.000000000 0.04956445
##  8:     8  16    4 23.9150  2.440   2.80 0.06933887 0.000000000 0.10400831
##  9:     8  16    5 24.0045  2.465   3.00 0.03602739 0.058286872 0.21616433
## 10:     8  16    6 23.9570  2.540   3.25 0.04666109 0.015553698 0.06221479

setDT(df) 将您的 df 转换为 data.table(无需小费)。 by=.(...) 子句定义组,.(...) 子句进行聚合。