使用 data.table 查找时间戳对之间重叠的持续时间
Find the duration of overlap between pairs of timestamps using data.table
类似于this question,我想使用data.table
.
找到时间戳对之间重叠的持续时间
这是我当前的代码:
library(data.table)
DT <- fread(
"stage,ID,date1,date2
1,A,2018-04-17 00:00:00,2018-04-17 01:00:00
1,B,2018-04-17 00:00:00,2018-04-17 00:20:00
1,C,2018-04-17 00:15:00,2018-04-17 01:00:00
2,B,2018-04-17 00:30:00,2018-04-17 01:10:00
2,D,2018-04-17 00:30:00,2018-04-17 00:50:00",
sep = ","
)
cols <- c("date1", "date2")
DT[, (cols) := lapply(.SD, as.POSIXct), .SDcols = cols]
breaks <- DT[, {
tmp <- unique(sort(c(date1, date2)))
.(start = head(tmp, -1L), end = tail(tmp, -1L))
}, by = stage]
result <- DT[breaks, on = .(stage, date1 <= start, date2 >= end), paste(ID, collapse = "+"),
by = .EACHI, allow.cartesian = T] %>%
mutate(lengthinseconds = as.numeric(difftime(date2, date1, units = "secs")))
哪个 returns:
stage date1 date2 V1 lengthinseconds
1 1 2018-04-17 00:00:00 2018-04-17 00:15:00 B+A 900
2 1 2018-04-17 00:15:00 2018-04-17 00:20:00 B+A+C 300
3 1 2018-04-17 00:20:00 2018-04-17 01:00:00 A+C 2400
4 2 2018-04-17 00:30:00 2018-04-17 00:50:00 D+B 1200
5 2 2018-04-17 00:50:00 2018-04-17 01:10:00 B 1200
但我希望 return 仅在用户二人组之间重叠(即不超过两个重叠用户)。我可以想到几种 hacky 方法来实现这一点,例如:
library(dplyr)
library(tidyr)
result %>%
filter(nchar(V1)==3) %>%
tidyr::separate(V1, c("ID1", "ID2"))
哪个 returns:
stage date1 date2 ID1 ID2 lengthinseconds
1 1 2018-04-17 00:00:00 2018-04-17 00:15:00 B A 900
2 1 2018-04-17 00:20:00 2018-04-17 01:00:00 A C 2400
3 2 2018-04-17 00:30:00 2018-04-17 00:50:00 D B 1200
但这似乎不够优雅,尤其是在处理更长的 ID
字符串和每次重叠可能有数百个 ID
时。
理想情况下,我想知道是否有办法将原始 data.table
代码直接修改为 return。
乍一看(忽略性能方面的考虑),这只需要对 OP 的代码稍作修改:
result <- DT[breaks, on = .(stage, date1 <= start, date2 >= end),
if (.N == 2L) paste(ID, collapse = "+"),
by = .EACHI, allow.cartesian = TRUE]
result
stage date1 date2 V1
1: 1 2018-04-17 00:00:00 2018-04-17 00:15:00 B+A
2: 1 2018-04-17 00:20:00 2018-04-17 01:00:00 A+C
3: 2 2018-04-17 00:30:00 2018-04-17 00:50:00 D+B
仅针对那些恰好有两个用户处于活动状态的组,即时间范围,将创建结果行。
OP 已要求在单独的列中显示两个 ID
,并显示重叠的持续时间。此外,我建议对 ID
进行排序。
result <- DT[breaks, on = .(stage, date1 <= start, date2 >= end),
if (.N == 2L) {
tmp <- sort(ID)
.(ID1 = tmp[1], ID2 = tmp[2], dur.in.sec = difftime(end, start, units = "secs"))
},
by = .EACHI, allow.cartesian = TRUE]
result
stage date1 date2 ID1 ID2 dur.in.sec
1: 1 2018-04-17 00:00:00 2018-04-17 00:15:00 A B 900 secs
2: 1 2018-04-17 00:20:00 2018-04-17 01:00:00 A C 2400 secs
3: 2 2018-04-17 00:30:00 2018-04-17 00:50:00 B D 1200 secs
另一种可能性:
DT[breaks, on = .(stage, date1 <= start, date2 >= end)
][, if (uniqueN(ID) == 2) .SD, by = .(stage, date1, date2)
][, dcast(.SD, stage + date1 + date2 ~ rowid(date1, prefix = 'ID'), value.var = 'ID')
][, lengthinseconds := as.numeric(difftime(date2, date1, units = "secs"))][]
给出:
stage date1 date2 ID1 ID2 lengthinseconds
1: 1 2018-04-17 00:00:00 2018-04-17 00:15:00 B A 900
2: 1 2018-04-17 00:20:00 2018-04-17 01:00:00 A C 2400
3: 2 2018-04-17 00:30:00 2018-04-17 00:50:00 D B 1200
类似于this question,我想使用data.table
.
这是我当前的代码:
library(data.table)
DT <- fread(
"stage,ID,date1,date2
1,A,2018-04-17 00:00:00,2018-04-17 01:00:00
1,B,2018-04-17 00:00:00,2018-04-17 00:20:00
1,C,2018-04-17 00:15:00,2018-04-17 01:00:00
2,B,2018-04-17 00:30:00,2018-04-17 01:10:00
2,D,2018-04-17 00:30:00,2018-04-17 00:50:00",
sep = ","
)
cols <- c("date1", "date2")
DT[, (cols) := lapply(.SD, as.POSIXct), .SDcols = cols]
breaks <- DT[, {
tmp <- unique(sort(c(date1, date2)))
.(start = head(tmp, -1L), end = tail(tmp, -1L))
}, by = stage]
result <- DT[breaks, on = .(stage, date1 <= start, date2 >= end), paste(ID, collapse = "+"),
by = .EACHI, allow.cartesian = T] %>%
mutate(lengthinseconds = as.numeric(difftime(date2, date1, units = "secs")))
哪个 returns:
stage date1 date2 V1 lengthinseconds
1 1 2018-04-17 00:00:00 2018-04-17 00:15:00 B+A 900
2 1 2018-04-17 00:15:00 2018-04-17 00:20:00 B+A+C 300
3 1 2018-04-17 00:20:00 2018-04-17 01:00:00 A+C 2400
4 2 2018-04-17 00:30:00 2018-04-17 00:50:00 D+B 1200
5 2 2018-04-17 00:50:00 2018-04-17 01:10:00 B 1200
但我希望 return 仅在用户二人组之间重叠(即不超过两个重叠用户)。我可以想到几种 hacky 方法来实现这一点,例如:
library(dplyr)
library(tidyr)
result %>%
filter(nchar(V1)==3) %>%
tidyr::separate(V1, c("ID1", "ID2"))
哪个 returns:
stage date1 date2 ID1 ID2 lengthinseconds
1 1 2018-04-17 00:00:00 2018-04-17 00:15:00 B A 900
2 1 2018-04-17 00:20:00 2018-04-17 01:00:00 A C 2400
3 2 2018-04-17 00:30:00 2018-04-17 00:50:00 D B 1200
但这似乎不够优雅,尤其是在处理更长的 ID
字符串和每次重叠可能有数百个 ID
时。
理想情况下,我想知道是否有办法将原始 data.table
代码直接修改为 return。
乍一看(忽略性能方面的考虑),这只需要对 OP 的代码稍作修改:
result <- DT[breaks, on = .(stage, date1 <= start, date2 >= end),
if (.N == 2L) paste(ID, collapse = "+"),
by = .EACHI, allow.cartesian = TRUE]
result
stage date1 date2 V1 1: 1 2018-04-17 00:00:00 2018-04-17 00:15:00 B+A 2: 1 2018-04-17 00:20:00 2018-04-17 01:00:00 A+C 3: 2 2018-04-17 00:30:00 2018-04-17 00:50:00 D+B
仅针对那些恰好有两个用户处于活动状态的组,即时间范围,将创建结果行。
OP 已要求在单独的列中显示两个 ID
,并显示重叠的持续时间。此外,我建议对 ID
进行排序。
result <- DT[breaks, on = .(stage, date1 <= start, date2 >= end),
if (.N == 2L) {
tmp <- sort(ID)
.(ID1 = tmp[1], ID2 = tmp[2], dur.in.sec = difftime(end, start, units = "secs"))
},
by = .EACHI, allow.cartesian = TRUE]
result
stage date1 date2 ID1 ID2 dur.in.sec 1: 1 2018-04-17 00:00:00 2018-04-17 00:15:00 A B 900 secs 2: 1 2018-04-17 00:20:00 2018-04-17 01:00:00 A C 2400 secs 3: 2 2018-04-17 00:30:00 2018-04-17 00:50:00 B D 1200 secs
另一种可能性:
DT[breaks, on = .(stage, date1 <= start, date2 >= end)
][, if (uniqueN(ID) == 2) .SD, by = .(stage, date1, date2)
][, dcast(.SD, stage + date1 + date2 ~ rowid(date1, prefix = 'ID'), value.var = 'ID')
][, lengthinseconds := as.numeric(difftime(date2, date1, units = "secs"))][]
给出:
stage date1 date2 ID1 ID2 lengthinseconds 1: 1 2018-04-17 00:00:00 2018-04-17 00:15:00 B A 900 2: 1 2018-04-17 00:20:00 2018-04-17 01:00:00 A C 2400 3: 2 2018-04-17 00:30:00 2018-04-17 00:50:00 D B 1200