使用 data.table 查找时间戳对之间重叠的持续时间

Find the duration of overlap between pairs of timestamps using data.table

类似于this question,我想使用data.table.




DT <- fread(
  1,A,2018-04-17 00:00:00,2018-04-17 01:00:00
  1,B,2018-04-17 00:00:00,2018-04-17 00:20:00
  1,C,2018-04-17 00:15:00,2018-04-17 01:00:00
  2,B,2018-04-17 00:30:00,2018-04-17 01:10:00
  2,D,2018-04-17 00:30:00,2018-04-17 00:50:00",
  sep = ","

cols <- c("date1", "date2")
DT[, (cols) := lapply(.SD, as.POSIXct), .SDcols = cols]

breaks <- DT[, {
  tmp <- unique(sort(c(date1, date2)))
  .(start = head(tmp, -1L), end = tail(tmp, -1L))
}, by = stage]

result <- DT[breaks, on = .(stage, date1 <= start, date2 >= end), paste(ID, collapse = "+"),  
    by = .EACHI, allow.cartesian = T] %>% 
  mutate(lengthinseconds = as.numeric(difftime(date2, date1, units = "secs")))

哪个 returns:

  stage               date1               date2    V1 lengthinseconds
1     1 2018-04-17 00:00:00 2018-04-17 00:15:00   B+A             900
2     1 2018-04-17 00:15:00 2018-04-17 00:20:00 B+A+C             300
3     1 2018-04-17 00:20:00 2018-04-17 01:00:00   A+C            2400
4     2 2018-04-17 00:30:00 2018-04-17 00:50:00   D+B            1200
5     2 2018-04-17 00:50:00 2018-04-17 01:10:00     B            1200

但我希望 return 仅在用户二人组之间重叠(即不超过两个重叠用户)。我可以想到几种 hacky 方法来实现这一点,例如:


result %>% 
  filter(nchar(V1)==3) %>% 
  tidyr::separate(V1, c("ID1", "ID2"))

哪个 returns:

  stage               date1               date2 ID1 ID2 lengthinseconds
1     1 2018-04-17 00:00:00 2018-04-17 00:15:00   B   A             900
2     1 2018-04-17 00:20:00 2018-04-17 01:00:00   A   C            2400
3     2 2018-04-17 00:30:00 2018-04-17 00:50:00   D   B            1200

但这似乎不够优雅,尤其是在处理更长的 ID 字符串和每次重叠可能有数百个 ID 时。

理想情况下,我想知道是否有办法将原始 data.table 代码直接修改为 return。

乍一看(忽略性能方面的考虑),这只需要对 OP 的代码稍作修改:

result <- DT[breaks, on = .(stage, date1 <= start, date2 >= end), 
             if (.N == 2L) paste(ID, collapse = "+"),  
             by = .EACHI, allow.cartesian = TRUE]
   stage               date1               date2  V1
1:     1 2018-04-17 00:00:00 2018-04-17 00:15:00 B+A
2:     1 2018-04-17 00:20:00 2018-04-17 01:00:00 A+C
3:     2 2018-04-17 00:30:00 2018-04-17 00:50:00 D+B


OP 已要求在单独的列中显示两个 ID,并显示重叠的持续时间。此外,我建议对 ID 进行排序。

result <- DT[breaks, on = .(stage, date1 <= start, date2 >= end), 
   if (.N == 2L) {
     tmp <- sort(ID)
     .(ID1 = tmp[1], ID2 = tmp[2], dur.in.sec = difftime(end, start, units = "secs"))
   by = .EACHI, allow.cartesian = TRUE]
   stage               date1               date2 ID1 ID2 dur.in.sec
1:     1 2018-04-17 00:00:00 2018-04-17 00:15:00   A   B   900 secs
2:     1 2018-04-17 00:20:00 2018-04-17 01:00:00   A   C  2400 secs
3:     2 2018-04-17 00:30:00 2018-04-17 00:50:00   B   D  1200 secs


DT[breaks, on = .(stage, date1 <= start, date2 >= end)
   ][, if (uniqueN(ID) == 2) .SD, by = .(stage, date1, date2)
     ][, dcast(.SD, stage + date1 + date2 ~ rowid(date1, prefix = 'ID'), value.var = 'ID')
       ][, lengthinseconds := as.numeric(difftime(date2, date1, units = "secs"))][]


   stage               date1               date2 ID1 ID2 lengthinseconds
1:     1 2018-04-17 00:00:00 2018-04-17 00:15:00   B   A             900
2:     1 2018-04-17 00:20:00 2018-04-17 01:00:00   A   C            2400
3:     2 2018-04-17 00:30:00 2018-04-17 00:50:00   D   B            1200