R - sqldf [估算具有不同日期的两个数据集之间的最接近值]

R - sqldf [Impute the closest value between two datasets with different dates]

我在 R 中使用 sqldf 合并两个数据集 A 和 B(合并键 'id'),并且需要来自 B 的权重值。 规则是如果没有匹配的日期,则为每次访问获取 A 和 B 之间的最接近日期。

  1. A 中的第 2 天将从 B 中“2021-03-10”的记录中提取 'weight'
  2. A 中的第 4 天将从 B 中“2021-03-28”的记录中提取 'weight'

数据A:

A <- read.table(text = "
    ord, id, Score,DATE, VISIT
    1,001,23,2021-03-01,DAY 0
    2,001,26,2021-03-03,DAY 1
    3,001,45,2021-03-10,DAY 2
    4,001,41,2021-03-20,DAY 3
    5,001,67,2021-03-26,DAY 4", header = TRUE,sep = ",")

数据B:

B <- read.table(text = "
    ord, id, Weight,DATE
    1,001,100,2021-03-01
    2,001,100.5,2021-03-03
    3,001,101,2021-03-06
    4,001,103,2021-03-20
    5,001,102,2021-03-28", header = TRUE,sep = ",")

预期结果:

A_B <- read.table(text = "
    ord, id, Score,DATE, VISIT, Weight
    1,001,23,2021-03-01,DAY 0,100
    2,001,26,2021-03-03,DAY 1,101
    3,001,45,2021-03-10,DAY 2,100.5
    4,001,41,2021-03-20,DAY 3,103
    5,001,67,2021-03-26,DAY 4,102", header = TRUE,sep = ",")

您可以将 nearest 滚动连接与 data.table 结合使用:

library(data.table)

setDT(A)
setDT(B)

A[,DATE:=as.Date(DATE)]
B[,DATE:=as.Date(DATE)]

B[A, .(ord,id,Score,DATE=i.DATE,VISIT,Weight),roll="nearest", on=.(id,DATE) ]
#>    ord id Score       DATE VISIT Weight
#> 1:   1  1    23 2021-03-01 DAY 0  100.0
#> 2:   2  1    26 2021-03-03 DAY 1  100.5
#> 3:   3  1    45 2021-03-10 DAY 2  101.0
#> 4:   4  1    41 2021-03-20 DAY 3  103.0
#> 5:   5  1    67 2021-03-26 DAY 4  102.0

如果您不熟悉 data.table,我发现这个 tutorial 对 link SQL 的逻辑很有用。

对行 id 执行左连接分组,计算日期之间的最小差异 (diff),作为副作用将拉入满足该最小值的行。

library(sqldf)

A$DATE <- as.Date(A$DATE)
B$DATE <- as.Date(B$DATE)

sqldf("select a.*, min(abs(a.DATE - b.DATE)) diff, b.Weight
  from A as a
  left join B as b using(id)
  group by a.rowid")

给予:

  ord id Score       DATE VISIT diff Weight
1   1  1    23 2021-03-01 DAY 0    0  100.0
2   2  1    26 2021-03-03 DAY 1    0  100.5
3   3  1    45 2021-03-10 DAY 2    4  101.0
4   4  1    41 2021-03-20 DAY 3    0  103.0
5   5  1    67 2021-03-26 DAY 4    2  102.0