R - sqldf [估算具有不同日期的两个数据集之间的最接近值]
R - sqldf [Impute the closest value between two datasets with different dates]
我在 R 中使用 sqldf 合并两个数据集 A 和 B(合并键 'id'),并且需要来自 B 的权重值。
规则是如果没有匹配的日期,则为每次访问获取 A 和 B 之间的最接近日期。
- A 中的第 2 天将从 B 中“2021-03-10”的记录中提取 'weight'
- A 中的第 4 天将从 B 中“2021-03-28”的记录中提取 'weight'
数据A:
A <- read.table(text = "
ord, id, Score,DATE, VISIT
1,001,23,2021-03-01,DAY 0
2,001,26,2021-03-03,DAY 1
3,001,45,2021-03-10,DAY 2
4,001,41,2021-03-20,DAY 3
5,001,67,2021-03-26,DAY 4", header = TRUE,sep = ",")
数据B:
B <- read.table(text = "
ord, id, Weight,DATE
1,001,100,2021-03-01
2,001,100.5,2021-03-03
3,001,101,2021-03-06
4,001,103,2021-03-20
5,001,102,2021-03-28", header = TRUE,sep = ",")
预期结果:
A_B <- read.table(text = "
ord, id, Score,DATE, VISIT, Weight
1,001,23,2021-03-01,DAY 0,100
2,001,26,2021-03-03,DAY 1,101
3,001,45,2021-03-10,DAY 2,100.5
4,001,41,2021-03-20,DAY 3,103
5,001,67,2021-03-26,DAY 4,102", header = TRUE,sep = ",")
您可以将 nearest
滚动连接与 data.table
结合使用:
library(data.table)
setDT(A)
setDT(B)
A[,DATE:=as.Date(DATE)]
B[,DATE:=as.Date(DATE)]
B[A, .(ord,id,Score,DATE=i.DATE,VISIT,Weight),roll="nearest", on=.(id,DATE) ]
#> ord id Score DATE VISIT Weight
#> 1: 1 1 23 2021-03-01 DAY 0 100.0
#> 2: 2 1 26 2021-03-03 DAY 1 100.5
#> 3: 3 1 45 2021-03-10 DAY 2 101.0
#> 4: 4 1 41 2021-03-20 DAY 3 103.0
#> 5: 5 1 67 2021-03-26 DAY 4 102.0
如果您不熟悉 data.table
,我发现这个 tutorial 对 link SQL
的逻辑很有用。
对行 id 执行左连接分组,计算日期之间的最小差异 (diff),作为副作用将拉入满足该最小值的行。
library(sqldf)
A$DATE <- as.Date(A$DATE)
B$DATE <- as.Date(B$DATE)
sqldf("select a.*, min(abs(a.DATE - b.DATE)) diff, b.Weight
from A as a
left join B as b using(id)
group by a.rowid")
给予:
ord id Score DATE VISIT diff Weight
1 1 1 23 2021-03-01 DAY 0 0 100.0
2 2 1 26 2021-03-03 DAY 1 0 100.5
3 3 1 45 2021-03-10 DAY 2 4 101.0
4 4 1 41 2021-03-20 DAY 3 0 103.0
5 5 1 67 2021-03-26 DAY 4 2 102.0
我在 R 中使用 sqldf 合并两个数据集 A 和 B(合并键 'id'),并且需要来自 B 的权重值。 规则是如果没有匹配的日期,则为每次访问获取 A 和 B 之间的最接近日期。
- A 中的第 2 天将从 B 中“2021-03-10”的记录中提取 'weight'
- A 中的第 4 天将从 B 中“2021-03-28”的记录中提取 'weight'
数据A:
A <- read.table(text = "
ord, id, Score,DATE, VISIT
1,001,23,2021-03-01,DAY 0
2,001,26,2021-03-03,DAY 1
3,001,45,2021-03-10,DAY 2
4,001,41,2021-03-20,DAY 3
5,001,67,2021-03-26,DAY 4", header = TRUE,sep = ",")
数据B:
B <- read.table(text = "
ord, id, Weight,DATE
1,001,100,2021-03-01
2,001,100.5,2021-03-03
3,001,101,2021-03-06
4,001,103,2021-03-20
5,001,102,2021-03-28", header = TRUE,sep = ",")
预期结果:
A_B <- read.table(text = "
ord, id, Score,DATE, VISIT, Weight
1,001,23,2021-03-01,DAY 0,100
2,001,26,2021-03-03,DAY 1,101
3,001,45,2021-03-10,DAY 2,100.5
4,001,41,2021-03-20,DAY 3,103
5,001,67,2021-03-26,DAY 4,102", header = TRUE,sep = ",")
您可以将 nearest
滚动连接与 data.table
结合使用:
library(data.table)
setDT(A)
setDT(B)
A[,DATE:=as.Date(DATE)]
B[,DATE:=as.Date(DATE)]
B[A, .(ord,id,Score,DATE=i.DATE,VISIT,Weight),roll="nearest", on=.(id,DATE) ]
#> ord id Score DATE VISIT Weight
#> 1: 1 1 23 2021-03-01 DAY 0 100.0
#> 2: 2 1 26 2021-03-03 DAY 1 100.5
#> 3: 3 1 45 2021-03-10 DAY 2 101.0
#> 4: 4 1 41 2021-03-20 DAY 3 103.0
#> 5: 5 1 67 2021-03-26 DAY 4 102.0
如果您不熟悉 data.table
,我发现这个 tutorial 对 link SQL
的逻辑很有用。
对行 id 执行左连接分组,计算日期之间的最小差异 (diff),作为副作用将拉入满足该最小值的行。
library(sqldf)
A$DATE <- as.Date(A$DATE)
B$DATE <- as.Date(B$DATE)
sqldf("select a.*, min(abs(a.DATE - b.DATE)) diff, b.Weight
from A as a
left join B as b using(id)
group by a.rowid")
给予:
ord id Score DATE VISIT diff Weight
1 1 1 23 2021-03-01 DAY 0 0 100.0
2 2 1 26 2021-03-03 DAY 1 0 100.5
3 3 1 45 2021-03-10 DAY 2 4 101.0
4 4 1 41 2021-03-20 DAY 3 0 103.0
5 5 1 67 2021-03-26 DAY 4 2 102.0