根据R中的公共ID和最接近的时间戳（小于5分钟）匹配两个数据帧

Question

我有两个要合并的数据集，第一个看起来像 dat1=

ID	timestamp
1	2020-10-26 06:37:23
2	2020-10-26 07:43:16
3	2020-10-26 09:36:52

与此同时，第二个数据集看起来像 dat2=

ID	timestamp	x1	x2
1	2020-10-26 07:55:23	a	c
1	2020-10-26 06:39:23	b	b
1	2020-10-26 08:28:39	c	e
2	2020-10-26 10:56:12	d	a
2	2020-10-26 18:39:52	e	e
3	2020-10-26 09:37:52	a	a
3	2020-10-26 10:16:17	b	f
3	2020-10-27 07:54:45	c	d
3	2020-10-27 08:25:44	d	a

最终输出应该根据公共ID和最接近的时间戳来匹配，但时间差应该小于5分钟。

我提到了一些类似的答案，这些答案规定了 data.table 包，甚至还有一个使用应用函数的基本包。但是，匹配非常奇怪，时间戳之间的实际差异太大。

最终输出看起来像这样

ID	timestamp	timestamp.y	x1	x2
1	2020-10-26 06:37:23	2020-10-26 06:39:23	b	b
2	2020-10-26 07:43:16	NA	NA	NA
3	2020-10-26 09:36:52	2020-10-26 09:37:52	a	a

有人可以帮我解决这个问题吗？实际数据集很大。

Answer 1

使用末尾注释中可重复显示的 dfata 帧，使用按 dat1 行指示的条件分组执行左连接，并在匹配的行上取最小秒差。删除末尾的秒差列 (4)。注意5分钟有60 * 5秒

library(sqldf)

out <- sqldf("select a.ID, 
              a.timestamp, 
              b.timestamp [timestamp.y], 
              min(abs(a.timestamp - b.timestamp)) seconds, 
              b.x1, 
              b.x2
  from dat1 a
  left join dat2 b on a.ID = b.ID and
                      abs(a.timestamp - b.timestamp) < 60 * 5
  group by a.rowid")[-4]
out$timestamp.y <- as.POSIXct(out$timestamp.y, origin = "1970-01-01")

# check
all.equal(out, target)
## [1] TRUE

备注

假设下面显示的输入和目标可重复。请注意，时间戳列具有 POSIXct class.

dat1 <-
structure(list(ID = 1:3, timestamp = structure(c(1603708643, 
1603712596, 1603719412), class = c("POSIXct", "POSIXt"), tzone = "")), 
row.names = c(NA, -3L), class = "data.frame")

dat2 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L), 
timestamp = structure(c(1603713323, 1603708763, 1603715319, 
1603724172, 1603751992, 1603719472, 1603721777, 
1603799685, 1603801544), class = c("POSIXct", "POSIXt"), tzone = ""), 
    x1 = c("a", "b", "c", "d", "e", "a", "b", "c", "d"), x2 = c("c", 
    "b", "e", "a", "e", "a", "f", "d", "a")), row.names = c(NA, 
-9L), class = "data.frame")

target <-
structure(list(ID = 1:3, timestamp = structure(c(1603708643, 
1603712596, 1603719412), class = c("POSIXct", "POSIXt")), 
timestamp.y = structure(c(1603708763, 
NA, 1603719472), class = c("POSIXct", "POSIXt")), 
    x1 = c("b", NA, "a"), x2 = c("b", NA, "a")), row.names = c(NA, 
-3L), class = "data.frame")

根据R中的公共ID和最接近的时间戳（小于5分钟）匹配两个数据帧

Matching two dataframes based on a common ID and the closest timestamp (less than 5 min) in R

merge

join

r

dataframe

dplyr

备注