在 R 中分解跨多个日期的观察

Breaking up observations across multiple dates in R

我希望找到 data.table + lubridate 解决以下问题的方法。

假设我有以下数据集:


library(data.table)
library(lubridate)
library(magrittr)

sample <- data.table(start = c("2018-12-22 23:00:00",
                               "2018-12-23 06:00:00",
                               "2018-12-22 06:00:00",
                               "2018-12-23 06:00:00"),
                     end = c("2018-12-23 06:00:00",
                             "2018-12-23 13:00:00",
                             "2018-12-23 12:00:00",
                             "2018-12-24 01:00:00"),
                     store = c("A", "A", "B", "B"),
                     var = 1:4)

sample[, start:= ymd_hms(start)]
sample[, end := ymd_hms(end)]

输出如下:


> sample
                 start                 end store var
1: 2018-12-22 23:00:00 2018-12-23 06:00:00     A   1
2: 2018-12-23 06:00:00 2018-12-23 13:00:00     A   2
3: 2018-12-22 06:00:00 2018-12-23 12:00:00     B   3
4: 2018-12-23 06:00:00 2018-12-24 01:00:00     B   4

请注意,在第 1、3 和 4 行,我们有跨越多个日期的观察结果。我想分解这些观察结果,使它们每个只占用一个日历日期。开始日期和结束日期也可能跨越多天,但我想对每个日期进行一次观察。对于上面的示例,data.table 应如下所示:

                 start                 end store var
1: 2018-12-22 23:00:00 2018-12-22 23:59:59     A   1
2: 2018-12-23 00:00:00 2018-12-23 06:00:00     A   1
3: 2018-12-23 06:00:00 2018-12-23 13:00:00     A   2
4: 2018-12-22 06:00:00 2018-12-22 23:59:59     B   3
5: 2018-12-23 00:00:00 2018-12-23 12:00:00     B   3
6: 2018-12-23 06:00:00 2018-12-23 23:59:59     B   4
7: 2018-12-24 00:00:00 2018-12-24 01:00:00     B   4

重要的是,当我们分解多个日期的观察时,var 变量是相同的。

谢谢!

# expand the rows
sample = sample[sample[, .(date = seq(as.IDate(start),as.IDate(end),1)), by=var], on="var"]

# fix the times
sample[, `:=`(
  start = fifelse(as.IDate(start) == date,
                  ymd_hms(paste0(as.Date(start),as.ITime(start))),
                  ymd_hms(paste0(date,"00:00:00"))),
  end = fifelse(as.IDate(end) == date,
                ymd_hms(paste0(as.Date(end),as.ITime(end))),
                ymd_hms(paste0(date, "23:59:59"))),
  date = NULL
)]

输出:

                 start                 end store var
1: 2018-12-22 23:00:00 2018-12-22 23:59:59     A   1
2: 2018-12-23 00:00:00 2018-12-23 06:00:00     A   1
3: 2018-12-23 06:00:00 2018-12-23 13:00:00     A   2
4: 2018-12-22 06:00:00 2018-12-22 23:59:59     B   3
5: 2018-12-23 00:00:00 2018-12-23 12:00:00     B   3
6: 2018-12-23 06:00:00 2018-12-23 23:59:59     B   4
7: 2018-12-24 00:00:00 2018-12-24 01:00:00     B   4

使用简单的 helper-function,

library(lubridate)
func <- function(st, en) { 
  days <- seq(floor_date(min(st), unit = "days"), 
              ceiling_date(max(en), unit = "days"),
              by = "1 day")
  days <- c(st, days[-c(1, length(days))], en)
  list(days[-length(days)], days[-1])
}

我们得到:

library(data.table)
sample[, setNames(func(start, end), c("start", "end")), by = .(store, var)]
#     store   var               start                 end
#    <char> <int>              <POSc>              <POSc>
# 1:      A     1 2018-12-22 23:00:00 2018-12-23 00:00:00
# 2:      A     1 2018-12-23 00:00:00 2018-12-23 06:00:00
# 3:      A     2 2018-12-23 06:00:00 2018-12-23 13:00:00
# 4:      B     3 2018-12-22 06:00:00 2018-12-23 00:00:00
# 5:      B     3 2018-12-23 00:00:00 2018-12-23 12:00:00
# 6:      B     4 2018-12-23 06:00:00 2018-12-24 00:00:00
# 7:      B     4 2018-12-24 00:00:00 2018-12-24 01:00:00