R 两个日期之间的求和值不重复

Question

我有以下数据集：

数据：

我想对60天内开始的所有事件的值进行分组求和（已经计算出区间=Start_interval,End_interval） 不在多个区间内添加同一行。

预期输出：

我进行了研究并找到了一些解决方案，直到那时，我才获得下面显示的结果，非常接近我的预期。例如：

题目的区别在于我需要求和而不是按组对多个周期的同一行进行加法

到目前为止我得到的结果：

有人有什么建议吗？

可重现的例子：

# Input data
data <- data.table(id = c("Group A", "Group A", "Group A", "Group A",
                          "Group A", "Group A"),
                   start_date_event = c("2019-09-15",
                                        "2019-11-24",
                                        "2020-04-19",
                                        "2020-04-25",
                                        "2020-05-25",
                                        "2020-10-27"),
                     end_date_event = c("2019-09-24",
                                        "2019-11-28",
                                        "2020-04-23",
                                        "2020-04-29",
                                        "2020-05-27",
                                        "2020-11-06"),
                     start_interval = c("2019-09-15",
                                        "2019-11-24",
                                        "2020-04-19",
                                        "2020-04-25",
                                        "2020-05-25",
                                        "2020-10-27"),
                       end_interval = c("2019-11-14",
                                        "2020-01-23",
                                        "2020-06-18",
                                        "2020-06-24",
                                        "2020-07-24",
                                        "2020-12-26"),
                     value = c(9, 4, 4, 4, 2, 15))

# Convert to date
data <- data %>%
        dplyr::mutate(start_date_event = as.Date(start_date_event)) %>%
        dplyr::mutate(end_date_event = as.Date(end_date_event)) %>%
        dplyr::mutate(start_interval = as.Date(start_interval)) %>%
        dplyr::mutate(end_interval = as.Date(end_interval))

# Calculating with non-equi join
temp <- data[data,
          on = .(start_date_event <= end_interval,
                 end_date_event >= start_interval)][,
          .(value_sum = sum(value)),
          by = .(id, start_date_event)]

# Get all
data <- merge(data, temp, all.x = T,
              by.x = c("id", "end_interval"),
              by.y = c("id", "start_date_event"))

谢谢！

Answer 1

这里有一个看似复杂的方法可以得到你的结果：

data[, rn := seq_len(.N)
  ][data, on = .(id, start_date_event >= start_interval, end_date_event <= end_interval)
  ][, z := fifelse(rleid(i.rn) > 1, 0, value), by = rn
  ][, value_sum := sum(z), by = i.rn
  ][, .SD[1,], .SDcols = patterns("^.[^.]"), by=.(i.rn)
  ][, c("rn", "i.rn") := NULL ]
#         id start_date_event end_date_event start_interval end_interval value value_sum
#     <char>           <Date>         <Date>         <Date>       <Date> <num>     <num>
# 1: Group A       2019-09-15     2019-11-14     2019-09-15   2019-11-14     9         9
# 2: Group A       2019-11-24     2020-01-23     2019-11-24   2020-01-23     4         4
# 3: Group A       2020-04-19     2020-06-18     2020-04-19   2020-06-18     4        10
# 4: Group A       2020-04-25     2020-06-24     2020-04-25   2020-06-24     4         0
# 5: Group A       2020-05-25     2020-07-24     2020-05-25   2020-07-24     2         0
# 6: Group A       2020-10-27     2020-12-26     2020-10-27   2020-12-26    15        15

Answer 2

这是一个选项..

首先，更改日期（这就像您上面的 dplyr/mutate 语句）

data <- cbind(data[, .(id, value)], data[, lapply(.SD, as.Date), .SDcols = c(2,3,4,5)])

在组

内添加一个event_id列

data[order(id,start_date_event), event_id:=1:.N, id]

通过 id 和关键字 table 获得 table 个独特的“周期”，用于 foverlaps

periods <- data[, .(id, start_interval, end_interval)][, period:=1:.N, by=id]
setkey(periods, id, start_interval, end_interval)

使用快速重叠为每个事件关联一个周期，然后得到每个事件的最小周期，以及每个周期的值之和

period_id <- foverlaps(data, periods, by.x = c("id", "start_date_event", "end_date_event"))

通过这些步骤创建值总和列

# Get the value_sums, by merging the minimum period by event
# with the sum over the values by period
value_sums = period_id[,.(period = min(period)),
          by=.(id, event_id)][
            period_id[
              , .(value_sum = sum(value)),
              by = .(id, period)],
            on=.(id, period), nomatch=0]

# convert the value sum column to zero if it is not the first row, by associated period
value_sums[order(id,event_id, period),value_sum:=value_sum*((1:.N)==1), by=.(id, period)]


# merge back on to data (dropping the period column)
data[value_sums[, !c("period")], on=.(id,event_id)]

输出：

        id value start_date_event end_date_event start_interval end_interval event_id value_sum
1: Group A     9       2019-09-15     2019-09-24     2019-09-15   2019-11-14        1         9
2: Group A     4       2019-11-24     2019-11-28     2019-11-24   2020-01-23        2         4
3: Group A     4       2020-04-19     2020-04-23     2020-04-19   2020-06-18        3        10
4: Group A     4       2020-04-25     2020-04-29     2020-04-25   2020-06-24        4         0
5: Group A     2       2020-05-25     2020-05-27     2020-05-25   2020-07-24        5         0
6: Group A    15       2020-10-27     2020-11-06     2020-10-27   2020-12-26        6        15

R 两个日期之间的求和值不重复

R Sum values between two dates WITHOUT repetition

r

date

cumulative-sum

data.table