R - 基于日期的滚动总和,每组有一个条件

R - Rolling sum based on dates, with a condition per group

我有以下数据集。

根据每一行的“start_date_event”,我已经总结了60天范围内发生的所有天数(变量sum_days) 从各自的活动开始日期开始。

但是有一个条件,例如,只有大于 15 天的总和必须被考虑。 因此,对于超过 15 天的事件,我想将“0”分配给属于相应期间的所有行。

预期输出:

预期结果示例:第 2 行已变为 0 ,因为它包含在总和大于十五天的前一行的范围内。第2行记录的事件开始于2019-02-28,属于2019-01-01(事件开始)到2019-03-06(60天间隔结束,01-01-2019 + 60) 总和大于 15 的第一行。

有人有什么建议吗?

可重现的例子:

library(data.table)
library(dplyr)

# Input data
data <- data.table(id = c("Group A", "Group A", "Group A", "Group A",
                          "Group B", "Group B"),
                   start_date_event = c("2019-01-01",
                                        "2019-02-28",
                                        "2019-03-13",
                                        "2019-03-19",
                                        "2020-04-02",
                                        "2020-05-15"),
                   end_date_event = c("2019-01-05",
                                      "2019-03-12",
                                      "2019-03-18",
                                      "2019-03-20",
                                      "2020-05-06",
                                      "2020-05-16"))

# Convert to date
data <- data %>%
          dplyr::mutate(start_date_event = as.Date(start_date_event)) %>%
          dplyr::mutate(end_date_event = as.Date(end_date_event)) %>%
          dplyr::mutate(days_diff = as.integer(end_date_event - start_date_event)) %>%
          dplyr::mutate(end_interval = end_date_event + 60) %>%
          data.table::setDT()

# Calculating cumulative sum within 60 days
data[.(c = id, tmin = start_date_event,
       tmax = start_date_event + 60),
   on = .(id == c, start_date_event <= tmax,
          start_date_event >= tmin),
   sum_days := sum(days_diff), by = .EACHI]

这应该有效:

library(sqldf)
library(dplyr)
library(data.table)

# Creating a new 'row column'
data$row_n <- 1:nrow(data)

# Identifying which lines overlap and then filtering data
data <- sqldf("select a.*, 
                      coalesce(group_concat(b.rowid), '') as overlaps
               from data a
               left join data b on a.id = b.id and 
                                   not a.rowid = b.rowid and
                                   ((a.start_date_event between
                                     b.start_date_event and b.end_interval) or
                                    (b.start_date_event between a.start_date_event
                                     and a.end_interval))
               group by a.rowid
               order by a.rowid") %>%
               group_by(id) %>%
               mutate(row_n = as.character(row_n),
                      previous_row = dplyr::lag(row_n, n = 1, default = NA),
                      previous_value = dplyr::lag(sum_days, n = 1, default = NA),
                      sum2 = case_when(mapply(grepl,previous_row, overlaps) == TRUE &
                                         previous_value > 15 ~ as.integer(0),
                                       TRUE ~ sum_days),
                      previous_value = dplyr::lag(sum2, n = 1, default = NA),
                      sum2 = case_when(mapply(grepl,previous_row, overlaps) == TRUE &
                                         previous_value > 15 ~ as.integer(0),
                                       TRUE ~ sum_days)) %>%
               dplyr::select(-c(previous_value, previous_row, row_n))