根据R中的公共ID和最接近的时间戳(小于5分钟)匹配两个数据帧

Matching two dataframes based on a common ID and the closest timestamp (less than 5 min) in R

我有两个要合并的数据集,第一个看起来像 dat1=

ID timestamp
1 2020-10-26 06:37:23
2 2020-10-26 07:43:16
3 2020-10-26 09:36:52

与此同时,第二个数据集看起来像 dat2=

ID timestamp x1 x2
1 2020-10-26 07:55:23 a c
1 2020-10-26 06:39:23 b b
1 2020-10-26 08:28:39 c e
2 2020-10-26 10:56:12 d a
2 2020-10-26 18:39:52 e e
3 2020-10-26 09:37:52 a a
3 2020-10-26 10:16:17 b f
3 2020-10-27 07:54:45 c d
3 2020-10-27 08:25:44 d a

最终输出应该根据公共ID和最接近的时间戳来匹配,但时间差应该小于5分钟。

我提到了一些类似的答案,这些答案规定了 data.table 包,甚至还有一个使用应用函数的基本包。但是,匹配非常奇怪,时间戳之间的实际差异太大。

最终输出看起来像这样

ID timestamp timestamp.y x1 x2
1 2020-10-26 06:37:23 2020-10-26 06:39:23 b b
2 2020-10-26 07:43:16 NA NA NA
3 2020-10-26 09:36:52 2020-10-26 09:37:52 a a

有人可以帮我解决这个问题吗?实际数据集很大。

使用末尾注释中可重复显示的 dfata 帧,使用按 dat1 行指示的条件分组执行左连接,并在匹配的行上取最小秒差。删除末尾的秒差列 (4)。注意5分钟有60 * 5秒

library(sqldf)

out <- sqldf("select a.ID, 
              a.timestamp, 
              b.timestamp [timestamp.y], 
              min(abs(a.timestamp - b.timestamp)) seconds, 
              b.x1, 
              b.x2
  from dat1 a
  left join dat2 b on a.ID = b.ID and
                      abs(a.timestamp - b.timestamp) < 60 * 5
  group by a.rowid")[-4]
out$timestamp.y <- as.POSIXct(out$timestamp.y, origin = "1970-01-01")

# check
all.equal(out, target)
## [1] TRUE

备注

假设下面显示的输入和目标可重复。请注意,时间戳列具有 POSIXct class.

dat1 <-
structure(list(ID = 1:3, timestamp = structure(c(1603708643, 
1603712596, 1603719412), class = c("POSIXct", "POSIXt"), tzone = "")), 
row.names = c(NA, -3L), class = "data.frame")

dat2 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L), 
timestamp = structure(c(1603713323, 1603708763, 1603715319, 
1603724172, 1603751992, 1603719472, 1603721777, 
1603799685, 1603801544), class = c("POSIXct", "POSIXt"), tzone = ""), 
    x1 = c("a", "b", "c", "d", "e", "a", "b", "c", "d"), x2 = c("c", 
    "b", "e", "a", "e", "a", "f", "d", "a")), row.names = c(NA, 
-9L), class = "data.frame")

target <-
structure(list(ID = 1:3, timestamp = structure(c(1603708643, 
1603712596, 1603719412), class = c("POSIXct", "POSIXt")), 
timestamp.y = structure(c(1603708763, 
NA, 1603719472), class = c("POSIXct", "POSIXt")), 
    x1 = c("b", NA, "a"), x2 = c("b", NA, "a")), row.names = c(NA, 
-3L), class = "data.frame")