如何在保持其他列不变的情况下,基于多个组聚合和平均各种行
How to aggregate and average various rows based on multiple groups while keeping other columns intact
我花了几个小时试图找出如何做到这一点,但我不确定最好的方法是什么。我有物种和环境数据,每个站点有 2 个重复(每个特定的日期和月份),我想合并在每个特定站点、月份和日期的 'S' 和 'E' 拖车收集的数据。我正在做一些分析,想合并 'S' 和 'E' 拖车,因此每个站点的两个拖车只有一行数据(按天和按月)。我不确定如何很好地口头解释这一点,所以我会尝试展示一个例子来更好地解释我自己。
这是我的数据的简化版本:
structure(list(month = c("11", "11", "11", "11", "11", "11",
"7", "7", "7", "7", "7", "7", "8", "8", "8", "8"), day = c("4",
"4", "4", "4", "5", "5", "20", "20", "27", "27", "27", "27",
"16", "16", "16", "16"), Site = c(11L, 11L, 6L, 6L, 9L, 9L, 10L,
10L, 13L, 13L, 2L, 3L, 4L, 5L, 5L, 6L), Tow = c("E", "S", "E",
"S", "E", "S", "E", "S", "E", "S", "S", "S", "S", "E", "S", "E"
), Depth = c(10L, 11L, 22L, 22L, 12L, 13L, 13L, 13L, 19L, 19L,
14L, 21L, 22L, 22L, 22L, 22L), Temp = c(12.75, 12.9, 14.25, 14.239,
12.975, 12.955, 23.804, 23.804, 23.89, 23.9, 24.41, 24.04, 23.915,
23.988, 24.021, 23.957), DO_mgL = c(10.54, 10.45, 10.16, 10.12,
10.4, 10.39, 7.24, 7.11, 8.07, 8.1, 9.14, 1.29, 2.44, 2.45, 2.48,
2.54), secchi = c(1.25, 1.25, 2.25, 2.25, 1.5, 1.5, 2.7, 2.7,
2.1, 2.1, 2.75, 1.25, 2.8, 3, 3, 3.25), d.lept = c(0, 0, 0, 0,
0, 0, 0.008037479, 0.155240934, 0.128494423, 0.025249815, 0.053921767,
0.012391113, 0.069338871, 0.022259485, 0.013767903, 0.046661095
), d.byths = c(0, 0, 0, 0, 0, 0, 0, 0.007392425, 0, 0, 0, 0,
0, 0.044518969, 0.013767903, 0.015553698), d.daph = c(0.140036552,
1.010093452, 1.629907953, 2.762608821, 1.130338642, 1.311853781,
0.031419235, 0.029569702, 0.0525659, 0.084166051, 0.024509894,
0.049564452, 0.104008307, 0.133556908, 0.082607421, 0.062214794
)), row.names = c(1L, 2L, 3L, 4L, 21L, 22L, 23L, 24L, 33L, 34L,
35L, 36L, 58L, 59L, 60L, 61L), class = "data.frame")
例如,我的前两行(月 == 11,日 == 4,站点 == 11)应该以 1 行结尾,其中 'Temp' 和 'DO_mgL'行平均后,'secchi' 读数对于 'S' 和 'E' 丝束总是相同的,因此应该保持相同并且物种密度应该相加(相加)。完成后可以移除牵引柱。我想以这样的方式结束(只是展示前两行最终应该是什么样子的例子)。
month
day
Site
Depth
Temp
DO_mgL
secchi
d.lept
d.byths
d.daph
11
4
11
10.5
12.83
10.50
1.25
0
0
1.15013
老实说,我什至不确定从哪里开始实现这一目标。执行以下操作在某种程度上实现了我对我的物种的要求,但这一次只针对一个物种(我总共有 8 个物种,在此示例中缩短)并删除其他列:
aggregate(d.lept ~ month + day + Site, data=zp1, FUN = sum)
同样,我需要将 'S' 和 'E' 拖车作为一个集合处理:
- 'Temp' 和 'DO_mgL' 在每个站点、日期和月份的 'S' 和 'E' 拖车之间取平均值
- 保持 'secchi' 不变,因为每个 'S' 和 'E' 组合的值都相同
- 为每个地点、日期和月份添加 'S' 和 'E' 丝束之间的物种密度
更复杂的是,由于 time/weather 现场的限制,有时我们无法收集重复数据,因此某些站点只有 'S' 丝束的数据,这些应该保持原样,因为那里对于那些特定的 sites/day/months.
只有一行
我的整个数据集有 97 行和 16 列。我在 7 月、8 月和 9 月总共采样了 24 个站点。我有 8 个物种及其相关密度(来自计数)。
我查看了以下关于我的问题的总和部分的帖子,但它们对我的帮助不大:, , and 。
我希望这是清楚且有道理的,但我很乐意提供进一步的说明。谢谢你的时间。
您所要求的可以使用标准 group_by
+ summarise
与 tidyverse 的组合来完成。多读书可以found here.
library(tidyverse)
library(janitor)
#>
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#>
#> chisq.test, fisher.test
df <- structure(list(month = c("11", "11", "11", "11", "11", "11",
"7", "7", "7", "7", "7", "7", "8", "8", "8", "8"), day = c("4",
"4", "4", "4", "5", "5", "20", "20", "27", "27", "27", "27",
"16", "16", "16", "16"), Site = c(11L, 11L, 6L, 6L, 9L, 9L, 10L,
10L, 13L, 13L, 2L, 3L, 4L, 5L, 5L, 6L), Tow = c("E", "S", "E",
"S", "E", "S", "E", "S", "E", "S", "S", "S", "S", "E", "S", "E"
), Depth = c(10L, 11L, 22L, 22L, 12L, 13L, 13L, 13L, 19L, 19L,
14L, 21L, 22L, 22L, 22L, 22L), Temp = c(12.75, 12.9, 14.25, 14.239,
12.975, 12.955, 23.804, 23.804, 23.89, 23.9, 24.41, 24.04, 23.915,
23.988, 24.021, 23.957), DO_mgL = c(10.54, 10.45, 10.16, 10.12,
10.4, 10.39, 7.24, 7.11, 8.07, 8.1, 9.14, 1.29, 2.44, 2.45, 2.48,
2.54), secchi = c(1.25, 1.25, 2.25, 2.25, 1.5, 1.5, 2.7, 2.7,
2.1, 2.1, 2.75, 1.25, 2.8, 3, 3, 3.25), d.lept = c(0, 0, 0, 0,
0, 0, 0.008037479, 0.155240934, 0.128494423, 0.025249815, 0.053921767,
0.012391113, 0.069338871, 0.022259485, 0.013767903, 0.046661095
), d.byths = c(0, 0, 0, 0, 0, 0, 0, 0.007392425, 0, 0, 0, 0,
0, 0.044518969, 0.013767903, 0.015553698), d.daph = c(0.140036552,
1.010093452, 1.629907953, 2.762608821, 1.130338642, 1.311853781,
0.031419235, 0.029569702, 0.0525659, 0.084166051, 0.024509894,
0.049564452, 0.104008307, 0.133556908, 0.082607421, 0.062214794
)), row.names = c(1L, 2L, 3L, 4L, 21L, 22L, 23L, 24L, 33L, 34L,
35L, 36L, 58L, 59L, 60L, 61L), class = "data.frame") %>%
clean_names()
df %>%
group_by(month, day, site) %>%
summarise(avg_temp = mean(temp),
avg_do_mg_l = mean(do_mg_l),
secchi = secchi,
sum_d_lept = sum(d_lept),
sum_d_byths = sum(d_byths),
sum_d_daph = sum(d_daph),
.groups = "drop") %>%
distinct(month, day, site, .keep_all = TRUE)
#> # A tibble: 10 x 9
#> month day site avg_temp avg_do_mg_l secchi sum_d_lept sum_d_byths
#> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 11 4 6 14.2 10.1 2.25 0 0
#> 2 11 4 11 12.8 10.5 1.25 0 0
#> 3 11 5 9 13.0 10.4 1.5 0 0
#> 4 7 20 10 23.8 7.18 2.7 0.163 0.00739
#> 5 7 27 2 24.4 9.14 2.75 0.0539 0
#> 6 7 27 3 24.0 1.29 1.25 0.0124 0
#> 7 7 27 13 23.9 8.09 2.1 0.154 0
#> 8 8 16 4 23.9 2.44 2.8 0.0693 0
#> 9 8 16 5 24.0 2.46 3 0.0360 0.0583
#> 10 8 16 6 24.0 2.54 3.25 0.0467 0.0156
#> # ... with 1 more variable: sum_d_daph <dbl>
由 reprex package (v2.0.1)
于 2022-04-29 创建
听起来您想将每组中的行折叠成一行 (??)。
与data.table
:
library(data.table)
##
#
setDT(df)[, .(
Temp = mean(Temp),
DO_mgL = mean(DO_mgL),
secchi = mean(secchi),
d.lept = sum(d.lept),
d.byths = sum(d.byths),
d.daph = sum(d.daph)
), by=.(month, day, Site)]
## month day Site Temp DO_mgL secchi d.lept d.byths d.daph
## 1: 11 4 11 12.8250 10.495 1.25 0.00000000 0.000000000 1.15013000
## 2: 11 4 6 14.2445 10.140 2.25 0.00000000 0.000000000 4.39251677
## 3: 11 5 9 12.9650 10.395 1.50 0.00000000 0.000000000 2.44219242
## 4: 7 20 10 23.8040 7.175 2.70 0.16327841 0.007392425 0.06098894
## 5: 7 27 13 23.8950 8.085 2.10 0.15374424 0.000000000 0.13673195
## 6: 7 27 2 24.4100 9.140 2.75 0.05392177 0.000000000 0.02450989
## 7: 7 27 3 24.0400 1.290 1.25 0.01239111 0.000000000 0.04956445
## 8: 8 16 4 23.9150 2.440 2.80 0.06933887 0.000000000 0.10400831
## 9: 8 16 5 24.0045 2.465 3.00 0.03602739 0.058286872 0.21616433
## 10: 8 16 6 23.9570 2.540 3.25 0.04666109 0.015553698 0.06221479
setDT(df)
将您的 df
转换为 data.table
(无需小费)。 by=.(...)
子句定义组,.(...)
子句进行聚合。
我花了几个小时试图找出如何做到这一点,但我不确定最好的方法是什么。我有物种和环境数据,每个站点有 2 个重复(每个特定的日期和月份),我想合并在每个特定站点、月份和日期的 'S' 和 'E' 拖车收集的数据。我正在做一些分析,想合并 'S' 和 'E' 拖车,因此每个站点的两个拖车只有一行数据(按天和按月)。我不确定如何很好地口头解释这一点,所以我会尝试展示一个例子来更好地解释我自己。
这是我的数据的简化版本:
structure(list(month = c("11", "11", "11", "11", "11", "11",
"7", "7", "7", "7", "7", "7", "8", "8", "8", "8"), day = c("4",
"4", "4", "4", "5", "5", "20", "20", "27", "27", "27", "27",
"16", "16", "16", "16"), Site = c(11L, 11L, 6L, 6L, 9L, 9L, 10L,
10L, 13L, 13L, 2L, 3L, 4L, 5L, 5L, 6L), Tow = c("E", "S", "E",
"S", "E", "S", "E", "S", "E", "S", "S", "S", "S", "E", "S", "E"
), Depth = c(10L, 11L, 22L, 22L, 12L, 13L, 13L, 13L, 19L, 19L,
14L, 21L, 22L, 22L, 22L, 22L), Temp = c(12.75, 12.9, 14.25, 14.239,
12.975, 12.955, 23.804, 23.804, 23.89, 23.9, 24.41, 24.04, 23.915,
23.988, 24.021, 23.957), DO_mgL = c(10.54, 10.45, 10.16, 10.12,
10.4, 10.39, 7.24, 7.11, 8.07, 8.1, 9.14, 1.29, 2.44, 2.45, 2.48,
2.54), secchi = c(1.25, 1.25, 2.25, 2.25, 1.5, 1.5, 2.7, 2.7,
2.1, 2.1, 2.75, 1.25, 2.8, 3, 3, 3.25), d.lept = c(0, 0, 0, 0,
0, 0, 0.008037479, 0.155240934, 0.128494423, 0.025249815, 0.053921767,
0.012391113, 0.069338871, 0.022259485, 0.013767903, 0.046661095
), d.byths = c(0, 0, 0, 0, 0, 0, 0, 0.007392425, 0, 0, 0, 0,
0, 0.044518969, 0.013767903, 0.015553698), d.daph = c(0.140036552,
1.010093452, 1.629907953, 2.762608821, 1.130338642, 1.311853781,
0.031419235, 0.029569702, 0.0525659, 0.084166051, 0.024509894,
0.049564452, 0.104008307, 0.133556908, 0.082607421, 0.062214794
)), row.names = c(1L, 2L, 3L, 4L, 21L, 22L, 23L, 24L, 33L, 34L,
35L, 36L, 58L, 59L, 60L, 61L), class = "data.frame")
例如,我的前两行(月 == 11,日 == 4,站点 == 11)应该以 1 行结尾,其中 'Temp' 和 'DO_mgL'行平均后,'secchi' 读数对于 'S' 和 'E' 丝束总是相同的,因此应该保持相同并且物种密度应该相加(相加)。完成后可以移除牵引柱。我想以这样的方式结束(只是展示前两行最终应该是什么样子的例子)。
month | day | Site | Depth | Temp | DO_mgL | secchi | d.lept | d.byths | d.daph |
---|---|---|---|---|---|---|---|---|---|
11 | 4 | 11 | 10.5 | 12.83 | 10.50 | 1.25 | 0 | 0 | 1.15013 |
老实说,我什至不确定从哪里开始实现这一目标。执行以下操作在某种程度上实现了我对我的物种的要求,但这一次只针对一个物种(我总共有 8 个物种,在此示例中缩短)并删除其他列:
aggregate(d.lept ~ month + day + Site, data=zp1, FUN = sum)
同样,我需要将 'S' 和 'E' 拖车作为一个集合处理:
- 'Temp' 和 'DO_mgL' 在每个站点、日期和月份的 'S' 和 'E' 拖车之间取平均值
- 保持 'secchi' 不变,因为每个 'S' 和 'E' 组合的值都相同
- 为每个地点、日期和月份添加 'S' 和 'E' 丝束之间的物种密度
更复杂的是,由于 time/weather 现场的限制,有时我们无法收集重复数据,因此某些站点只有 'S' 丝束的数据,这些应该保持原样,因为那里对于那些特定的 sites/day/months.
只有一行我的整个数据集有 97 行和 16 列。我在 7 月、8 月和 9 月总共采样了 24 个站点。我有 8 个物种及其相关密度(来自计数)。
我查看了以下关于我的问题的总和部分的帖子,但它们对我的帮助不大:
我希望这是清楚且有道理的,但我很乐意提供进一步的说明。谢谢你的时间。
您所要求的可以使用标准 group_by
+ summarise
与 tidyverse 的组合来完成。多读书可以found here.
library(tidyverse)
library(janitor)
#>
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#>
#> chisq.test, fisher.test
df <- structure(list(month = c("11", "11", "11", "11", "11", "11",
"7", "7", "7", "7", "7", "7", "8", "8", "8", "8"), day = c("4",
"4", "4", "4", "5", "5", "20", "20", "27", "27", "27", "27",
"16", "16", "16", "16"), Site = c(11L, 11L, 6L, 6L, 9L, 9L, 10L,
10L, 13L, 13L, 2L, 3L, 4L, 5L, 5L, 6L), Tow = c("E", "S", "E",
"S", "E", "S", "E", "S", "E", "S", "S", "S", "S", "E", "S", "E"
), Depth = c(10L, 11L, 22L, 22L, 12L, 13L, 13L, 13L, 19L, 19L,
14L, 21L, 22L, 22L, 22L, 22L), Temp = c(12.75, 12.9, 14.25, 14.239,
12.975, 12.955, 23.804, 23.804, 23.89, 23.9, 24.41, 24.04, 23.915,
23.988, 24.021, 23.957), DO_mgL = c(10.54, 10.45, 10.16, 10.12,
10.4, 10.39, 7.24, 7.11, 8.07, 8.1, 9.14, 1.29, 2.44, 2.45, 2.48,
2.54), secchi = c(1.25, 1.25, 2.25, 2.25, 1.5, 1.5, 2.7, 2.7,
2.1, 2.1, 2.75, 1.25, 2.8, 3, 3, 3.25), d.lept = c(0, 0, 0, 0,
0, 0, 0.008037479, 0.155240934, 0.128494423, 0.025249815, 0.053921767,
0.012391113, 0.069338871, 0.022259485, 0.013767903, 0.046661095
), d.byths = c(0, 0, 0, 0, 0, 0, 0, 0.007392425, 0, 0, 0, 0,
0, 0.044518969, 0.013767903, 0.015553698), d.daph = c(0.140036552,
1.010093452, 1.629907953, 2.762608821, 1.130338642, 1.311853781,
0.031419235, 0.029569702, 0.0525659, 0.084166051, 0.024509894,
0.049564452, 0.104008307, 0.133556908, 0.082607421, 0.062214794
)), row.names = c(1L, 2L, 3L, 4L, 21L, 22L, 23L, 24L, 33L, 34L,
35L, 36L, 58L, 59L, 60L, 61L), class = "data.frame") %>%
clean_names()
df %>%
group_by(month, day, site) %>%
summarise(avg_temp = mean(temp),
avg_do_mg_l = mean(do_mg_l),
secchi = secchi,
sum_d_lept = sum(d_lept),
sum_d_byths = sum(d_byths),
sum_d_daph = sum(d_daph),
.groups = "drop") %>%
distinct(month, day, site, .keep_all = TRUE)
#> # A tibble: 10 x 9
#> month day site avg_temp avg_do_mg_l secchi sum_d_lept sum_d_byths
#> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 11 4 6 14.2 10.1 2.25 0 0
#> 2 11 4 11 12.8 10.5 1.25 0 0
#> 3 11 5 9 13.0 10.4 1.5 0 0
#> 4 7 20 10 23.8 7.18 2.7 0.163 0.00739
#> 5 7 27 2 24.4 9.14 2.75 0.0539 0
#> 6 7 27 3 24.0 1.29 1.25 0.0124 0
#> 7 7 27 13 23.9 8.09 2.1 0.154 0
#> 8 8 16 4 23.9 2.44 2.8 0.0693 0
#> 9 8 16 5 24.0 2.46 3 0.0360 0.0583
#> 10 8 16 6 24.0 2.54 3.25 0.0467 0.0156
#> # ... with 1 more variable: sum_d_daph <dbl>
由 reprex package (v2.0.1)
于 2022-04-29 创建听起来您想将每组中的行折叠成一行 (??)。
与data.table
:
library(data.table)
##
#
setDT(df)[, .(
Temp = mean(Temp),
DO_mgL = mean(DO_mgL),
secchi = mean(secchi),
d.lept = sum(d.lept),
d.byths = sum(d.byths),
d.daph = sum(d.daph)
), by=.(month, day, Site)]
## month day Site Temp DO_mgL secchi d.lept d.byths d.daph
## 1: 11 4 11 12.8250 10.495 1.25 0.00000000 0.000000000 1.15013000
## 2: 11 4 6 14.2445 10.140 2.25 0.00000000 0.000000000 4.39251677
## 3: 11 5 9 12.9650 10.395 1.50 0.00000000 0.000000000 2.44219242
## 4: 7 20 10 23.8040 7.175 2.70 0.16327841 0.007392425 0.06098894
## 5: 7 27 13 23.8950 8.085 2.10 0.15374424 0.000000000 0.13673195
## 6: 7 27 2 24.4100 9.140 2.75 0.05392177 0.000000000 0.02450989
## 7: 7 27 3 24.0400 1.290 1.25 0.01239111 0.000000000 0.04956445
## 8: 8 16 4 23.9150 2.440 2.80 0.06933887 0.000000000 0.10400831
## 9: 8 16 5 24.0045 2.465 3.00 0.03602739 0.058286872 0.21616433
## 10: 8 16 6 23.9570 2.540 3.25 0.04666109 0.015553698 0.06221479
setDT(df)
将您的 df
转换为 data.table
(无需小费)。 by=.(...)
子句定义组,.(...)
子句进行聚合。