创建条件 interval/group 以捕获数值变量的总和

Create a conditional interval/group to capture the sum of a numeric variable

我需要使用分类变量和条件知道在特定时间段内下了多少雨。与我提供的示例数据框相比,我的真实数据框有 150 多行,“分箱”或日期时间列的可变性要大得多。我正在寻找一个 for 循环或函数,以根据基于列“位置”的特定条件的特定和可变间隔或时间段来获取降水量的总和(例如“TotalRain”)。此“位置”列有空行、“开始”行或“结束”行。我的想法是在“组”列中创建 intervals/groups 以更好地识别这些 intervals/groups,然后使用这些组或另一个 for 循环求和,然后将该间隔期间的降水总和粘贴到列中“群雨。”

这是一个示例数据集,用于填写提供的“Groups”和“GroupRain”列:

dat<-data.frame(binned=as.POSIXct(c("2020-08-01 06:26:00", "2020-08-01 19:26:00", "2020-08-02 06:26:00", "2020-08-02 19:26:00", "2020-08-03 06:26:00","2020-08-03 19:26:00", "2020-08-04 06:26:00", "2020-08-04 19:26:00", "2020-08-05 06:26:00", "2020-08-05 19:26:00", "2020-08-06 06:26:00", "2020-08-06 19:26:00", "2020-08-07 06:26:00", "2020-08-07 19:26:00", "2020-08-08 06:26:00", "2020-08-08 19:26:00", "2020-08-09 06:26:00", "2020-08-09 19:26:00", "2020-08-10 06:26:00"), tz="America/Chicago"), position=c("","", "", "", "", "Start", "", "", "", "End", "", "", "", "Start", "End", "", "" ,"", "" ), TotalRain= as.numeric(c("0.0", "0.0", "0.1", "0.0", "0.0", "0.0", "0.2", "0.3", "0.0", "0.1", "0.0", "0.3", "0.0", "0.1", "0.0", "0.0", "0.4", "0.0", "0.0")), Groups=as.character(""), GroupRain=as.numeric(""))

这是上面代码中提供的数据帧的图像:

我遇到的问题是创建一个 for 循环或函数,该循环或函数开始对“TotalRain”行求和,直到条件发生或条件正在发生时。例如,我的数据框的前 5 行将是我的第一个 interval/group,我希望“Total Rain”的总和在“position”列等于“Start”时停止求和,然后将这个总数粘贴到第一个间隔结束的“GroupRain”列(例如第 5 行)。此数据框中的第一个“开始”将是我的第二个 interval/group 的开始,它也是 5 行长。对于这一秒 interval/group,我想再次获取“Total Rain”的总和,但是当“position”等于“End”时停止并将总和粘贴到这一行中。第三个间隔从位置不再等于“结束”且为空白(例如第 11 行到第 13 行)开始,一直持续到位置再次等于“开始”(例如第 14 行),并将其总和粘贴到第 13 行。第 4 个interval/group 长 2 行,即第 14 行和第 15 行,并将其总和粘贴到第 15 行。第 5 个也是最后一个 interval/group 长 4 行,即第 16 到 19 行,并且将有它的总和粘贴在第 19 行。下面是一个输出数据框的示例,其中我使用字母表手动创建了 5 interval/groups 并手动求和并将每个间隔的“总雨量”粘贴到“GroupRain”列中。我可以使用 R 中的“LETTERS”包来提供我的 interval/group 名称(A、B、C 等)。下面提供了我想要最终得到的数据框:

dat2<-data.frame(binned=as.POSIXct(c("2020-08-01 06:26:00", "2020-08-01 19:26:00", "2020-08-02 06:26:00", "2020-08-02 19:26:00", "2020-08-03 06:26:00","2020-08-03 19:26:00", "2020-08-04 06:26:00", "2020-08-04 19:26:00", "2020-08-05 06:26:00", "2020-08-05 19:26:00", "2020-08-06 06:26:00", "2020-08-06 19:26:00", "2020-08-07 06:26:00", "2020-08-07 19:26:00", "2020-08-08 06:26:00", "2020-08-08 19:26:00", "2020-08-09 06:26:00", "2020-08-09 19:26:00", "2020-08-10 06:26:00"), tz="America/Chicago"), position=c("","", "", "", "", "Start", "", "", "", "End", "", "", "", "Start", "End", "", "" ,"", "" ), TotalRain= as.numeric(c("0.0", "0.0", "0.1", "0.0", "0.0", "0.0", "0.2", "0.3", "0.0", "0.1", "0.0", "0.3", "0.0", "0.1", "0.0", "0.0", "0.4", "0.0", "0.0")), Groups=c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "D", "D", "E", "E", "E", "E"), GroupRain=as.numeric(c("", "", "", "", "0.1", "", "", "", "", "0.6", "0.0", "", "0.3", "", "0.1", "", "", "", "0.4")))

A 组或我的第一个间隔总共有 0.1 英寸。 B 组或我的第二个间隔总共有 0.6 英寸。 C 组总共有 0.3 英寸。 D 组总共有 0.1 英寸。而E组一共0.4寸。

我的最终目标是能够在不丢失我需要保留的所有中间数字数据(如雨)的情况下过滤数据帧。当我根据“开始”和“结束”位置进行过滤时,我希望在该间隔内已经将汇总的“TotalRain”列显示为“GroupRain”。 例如,我的最终代码将是这样的,它会生成以下图像并向我显示 interval/group B 总共下了 0.6 英寸的雨。 Interval/group D 总共下了 0.1 英寸的雨。

dat3<- dat2 %>%filter(position == "Start"|position == "End")

基础 R

dat$ends <- rev(cumsum(rev(dat$position == "End")))
dat$withstarts <- with(dat, ave(position, ends, FUN = function(z) cumsum(z == "Start")))
dat$GroupRain <- with(dat, ave(TotalRain, list(ends, withstarts), FUN = function(z) c(rep(NA, length(z)-1), sum(z)), drop=TRUE))
with(dat, ave(position, list(ends, withstarts), FUN = function(z) !all(c("Start","End") %in% z)))
#  [1] "TRUE"  "TRUE"  "TRUE"  "TRUE"  "TRUE"  "FALSE" "FALSE" "FALSE" "FALSE" "FALSE" "TRUE" 
# [12] "TRUE"  "TRUE"  "FALSE" "FALSE" "TRUE"  "TRUE"  "TRUE"  "TRUE" 
dat$GroupRain[with(dat, ave(position, list(ends, withstarts), FUN = function(z) !all(c("Start","End") %in% z))) == "TRUE"] <- NA
dat[c("ends","withstarts")] <- NULL
dat
#                 binned position TotalRain Groups GroupRain
# 1  2020-08-01 06:26:00                0.0               NA
# 2  2020-08-01 19:26:00                0.0               NA
# 3  2020-08-02 06:26:00                0.1               NA
# 4  2020-08-02 19:26:00                0.0               NA
# 5  2020-08-03 06:26:00                0.0               NA
# 6  2020-08-03 19:26:00    Start       0.0               NA
# 7  2020-08-04 06:26:00                0.2               NA
# 8  2020-08-04 19:26:00                0.3               NA
# 9  2020-08-05 06:26:00                0.0               NA
# 10 2020-08-05 19:26:00      End       0.1              0.6
# 11 2020-08-06 06:26:00                0.0               NA
# 12 2020-08-06 19:26:00                0.3               NA
# 13 2020-08-07 06:26:00                0.0               NA
# 14 2020-08-07 19:26:00    Start       0.1               NA
# 15 2020-08-08 06:26:00      End       0.0              0.1
# 16 2020-08-08 19:26:00                0.0               NA
# 17 2020-08-09 06:26:00                0.4               NA
# 18 2020-08-09 19:26:00                0.0               NA
# 19 2020-08-10 06:26:00                0.0               NA

或者,在添加 endswithstarts 之后(以上):

res <- aggregate(TotalRain ~ ends + withstarts, data = dat, FUN = sum)
names(res)[3] <- "GroupRain"
dat2 <- merge(dat, res, by = c("ends", "withstarts"))
dat2$GroupRain[dat2$position != "End"] <- NA
dat2[,c("ends","withstarts")] <- NULL
dat2
#                 binned position TotalRain Groups GroupRain
# 1  2020-08-08 19:26:00                0.0               NA
# 2  2020-08-09 06:26:00                0.4               NA
# 3  2020-08-09 19:26:00                0.0               NA
# 4  2020-08-10 06:26:00                0.0               NA
# 5  2020-08-06 06:26:00                0.0               NA
# 6  2020-08-06 19:26:00                0.3               NA
# 7  2020-08-07 06:26:00                0.0               NA
# 8  2020-08-07 19:26:00    Start       0.1               NA
# 9  2020-08-08 06:26:00      End       0.0              0.1
# 10 2020-08-01 06:26:00                0.0               NA
# 11 2020-08-01 19:26:00                0.0               NA
# 12 2020-08-02 06:26:00                0.1               NA
# 13 2020-08-02 19:26:00                0.0               NA
# 14 2020-08-03 06:26:00                0.0               NA
# 15 2020-08-03 19:26:00    Start       0.0               NA
# 16 2020-08-04 06:26:00                0.2               NA
# 17 2020-08-04 19:26:00                0.3               NA
# 18 2020-08-05 06:26:00                0.0               NA
# 19 2020-08-05 19:26:00      End       0.1              0.6

dplyr

library(dplyr)
dat %>%
  group_by(ends = rev(cumsum(rev(position == "End")))) %>%
  group_by(withstarts = cumsum(position == "Start"), add = TRUE) %>%
  mutate(GroupRain = if_else(all(c("Start", "End") %in% position) & row_number() == n(), sum(TotalRain), NA_real_)) %>%
  ungroup() %>%
  select(-ends, -withstarts)
# # A tibble: 19 x 5
#    binned              position TotalRain Groups GroupRain
#    <dttm>              <chr>        <dbl> <chr>      <dbl>
#  1 2020-08-01 06:26:00 ""             0   ""          NA  
#  2 2020-08-01 19:26:00 ""             0   ""          NA  
#  3 2020-08-02 06:26:00 ""             0.1 ""          NA  
#  4 2020-08-02 19:26:00 ""             0   ""          NA  
#  5 2020-08-03 06:26:00 ""             0   ""          NA  
#  6 2020-08-03 19:26:00 "Start"        0   ""          NA  
#  7 2020-08-04 06:26:00 ""             0.2 ""          NA  
#  8 2020-08-04 19:26:00 ""             0.3 ""          NA  
#  9 2020-08-05 06:26:00 ""             0   ""          NA  
# 10 2020-08-05 19:26:00 "End"          0.1 ""           0.6
# 11 2020-08-06 06:26:00 ""             0   ""          NA  
# 12 2020-08-06 19:26:00 ""             0.3 ""          NA  
# 13 2020-08-07 06:26:00 ""             0   ""          NA  
# 14 2020-08-07 19:26:00 "Start"        0.1 ""          NA  
# 15 2020-08-08 06:26:00 "End"          0   ""           0.1
# 16 2020-08-08 19:26:00 ""             0   ""          NA  
# 17 2020-08-09 06:26:00 ""             0.4 ""          NA  
# 18 2020-08-09 19:26:00 ""             0   ""          NA  
# 19 2020-08-10 06:26:00 ""             0   ""          NA  

数据

dat <- structure(list(binned = structure(c(1596281160, 1596327960, 1596367560, 1596414360, 1596453960, 1596500760, 1596540360, 1596587160, 1596626760, 1596673560, 1596713160, 1596759960, 1596799560, 1596846360, 1596885960, 1596932760, 1596972360, 1597019160, 1597058760), class = c("POSIXct", "POSIXt"), tzone = "America/Chicago"), position = c("", "", "", "", "", "Start", "", "", "", "End", "", "", "", "Start", "End", "", "", "", ""), TotalRain = c(0, 0, 0.1, 0, 0, 0, 0.2, 0.3, 0, 0.1, 0, 0.3, 0, 0.1, 0,  0, 0.4, 0, 0), Groups = c("", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""), GroupRain = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_)), row.names = c(NA, -19L), class = "data.frame")