根据R中的公共ID和最接近的时间戳(小于5分钟)匹配两个数据帧
Matching two dataframes based on a common ID and the closest timestamp (less than 5 min) in R
我有两个要合并的数据集,第一个看起来像
dat1=
ID
timestamp
1
2020-10-26 06:37:23
2
2020-10-26 07:43:16
3
2020-10-26 09:36:52
与此同时,第二个数据集看起来像
dat2=
ID
timestamp
x1
x2
1
2020-10-26 07:55:23
a
c
1
2020-10-26 06:39:23
b
b
1
2020-10-26 08:28:39
c
e
2
2020-10-26 10:56:12
d
a
2
2020-10-26 18:39:52
e
e
3
2020-10-26 09:37:52
a
a
3
2020-10-26 10:16:17
b
f
3
2020-10-27 07:54:45
c
d
3
2020-10-27 08:25:44
d
a
最终输出应该根据公共ID和最接近的时间戳来匹配,但时间差应该小于5分钟。
我提到了一些类似的答案,这些答案规定了 data.table 包,甚至还有一个使用应用函数的基本包。但是,匹配非常奇怪,时间戳之间的实际差异太大。
最终输出看起来像这样
ID
timestamp
timestamp.y
x1
x2
1
2020-10-26 06:37:23
2020-10-26 06:39:23
b
b
2
2020-10-26 07:43:16
NA
NA
NA
3
2020-10-26 09:36:52
2020-10-26 09:37:52
a
a
有人可以帮我解决这个问题吗?实际数据集很大。
使用末尾注释中可重复显示的 dfata 帧,使用按 dat1 行指示的条件分组执行左连接,并在匹配的行上取最小秒差。删除末尾的秒差列 (4)。注意5分钟有60 * 5秒
library(sqldf)
out <- sqldf("select a.ID,
a.timestamp,
b.timestamp [timestamp.y],
min(abs(a.timestamp - b.timestamp)) seconds,
b.x1,
b.x2
from dat1 a
left join dat2 b on a.ID = b.ID and
abs(a.timestamp - b.timestamp) < 60 * 5
group by a.rowid")[-4]
out$timestamp.y <- as.POSIXct(out$timestamp.y, origin = "1970-01-01")
# check
all.equal(out, target)
## [1] TRUE
备注
假设下面显示的输入和目标可重复。请注意,时间戳列具有 POSIXct class.
dat1 <-
structure(list(ID = 1:3, timestamp = structure(c(1603708643,
1603712596, 1603719412), class = c("POSIXct", "POSIXt"), tzone = "")),
row.names = c(NA, -3L), class = "data.frame")
dat2 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
timestamp = structure(c(1603713323, 1603708763, 1603715319,
1603724172, 1603751992, 1603719472, 1603721777,
1603799685, 1603801544), class = c("POSIXct", "POSIXt"), tzone = ""),
x1 = c("a", "b", "c", "d", "e", "a", "b", "c", "d"), x2 = c("c",
"b", "e", "a", "e", "a", "f", "d", "a")), row.names = c(NA,
-9L), class = "data.frame")
target <-
structure(list(ID = 1:3, timestamp = structure(c(1603708643,
1603712596, 1603719412), class = c("POSIXct", "POSIXt")),
timestamp.y = structure(c(1603708763,
NA, 1603719472), class = c("POSIXct", "POSIXt")),
x1 = c("b", NA, "a"), x2 = c("b", NA, "a")), row.names = c(NA,
-3L), class = "data.frame")
我有两个要合并的数据集,第一个看起来像 dat1=
ID | timestamp |
---|---|
1 | 2020-10-26 06:37:23 |
2 | 2020-10-26 07:43:16 |
3 | 2020-10-26 09:36:52 |
与此同时,第二个数据集看起来像 dat2=
ID | timestamp | x1 | x2 |
---|---|---|---|
1 | 2020-10-26 07:55:23 | a | c |
1 | 2020-10-26 06:39:23 | b | b |
1 | 2020-10-26 08:28:39 | c | e |
2 | 2020-10-26 10:56:12 | d | a |
2 | 2020-10-26 18:39:52 | e | e |
3 | 2020-10-26 09:37:52 | a | a |
3 | 2020-10-26 10:16:17 | b | f |
3 | 2020-10-27 07:54:45 | c | d |
3 | 2020-10-27 08:25:44 | d | a |
最终输出应该根据公共ID和最接近的时间戳来匹配,但时间差应该小于5分钟。
我提到了一些类似的答案,这些答案规定了 data.table 包,甚至还有一个使用应用函数的基本包。但是,匹配非常奇怪,时间戳之间的实际差异太大。
最终输出看起来像这样
ID | timestamp | timestamp.y | x1 | x2 |
---|---|---|---|---|
1 | 2020-10-26 06:37:23 | 2020-10-26 06:39:23 | b | b |
2 | 2020-10-26 07:43:16 | NA | NA | NA |
3 | 2020-10-26 09:36:52 | 2020-10-26 09:37:52 | a | a |
有人可以帮我解决这个问题吗?实际数据集很大。
使用末尾注释中可重复显示的 dfata 帧,使用按 dat1 行指示的条件分组执行左连接,并在匹配的行上取最小秒差。删除末尾的秒差列 (4)。注意5分钟有60 * 5秒
library(sqldf)
out <- sqldf("select a.ID,
a.timestamp,
b.timestamp [timestamp.y],
min(abs(a.timestamp - b.timestamp)) seconds,
b.x1,
b.x2
from dat1 a
left join dat2 b on a.ID = b.ID and
abs(a.timestamp - b.timestamp) < 60 * 5
group by a.rowid")[-4]
out$timestamp.y <- as.POSIXct(out$timestamp.y, origin = "1970-01-01")
# check
all.equal(out, target)
## [1] TRUE
备注
假设下面显示的输入和目标可重复。请注意,时间戳列具有 POSIXct class.
dat1 <-
structure(list(ID = 1:3, timestamp = structure(c(1603708643,
1603712596, 1603719412), class = c("POSIXct", "POSIXt"), tzone = "")),
row.names = c(NA, -3L), class = "data.frame")
dat2 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
timestamp = structure(c(1603713323, 1603708763, 1603715319,
1603724172, 1603751992, 1603719472, 1603721777,
1603799685, 1603801544), class = c("POSIXct", "POSIXt"), tzone = ""),
x1 = c("a", "b", "c", "d", "e", "a", "b", "c", "d"), x2 = c("c",
"b", "e", "a", "e", "a", "f", "d", "a")), row.names = c(NA,
-9L), class = "data.frame")
target <-
structure(list(ID = 1:3, timestamp = structure(c(1603708643,
1603712596, 1603719412), class = c("POSIXct", "POSIXt")),
timestamp.y = structure(c(1603708763,
NA, 1603719472), class = c("POSIXct", "POSIXt")),
x1 = c("b", NA, "a"), x2 = c("b", NA, "a")), row.names = c(NA,
-3L), class = "data.frame")