条件 left_join 按 ID 和最近日期
conditional left_join by id and nearest date
两个数据集根据它们的 id 和日期分开的条件被左连接
A <- data.frame(id = c(1,2,3),
application_date = as.Date(c("2010-05-08", "2012-08-08", "2013-06-23")))
B <- data.frame(id = c(1,1,2,2,3,3),
date = as.Date(c("2009-01-02", "2009-12-24", "2011-11-11", "2012-05-20", "2013-03-21", "2013-06-05")),
value1 = c(2500, 3000, 1200, 1900, 5500, 4500),
value2 = c(2500, 3000, 1200, 1900, 5500, 4500),
value3 = c(2500, 3000, 1200, 1900, 5500, 4500))
如果同一id,日期差小于或等于6个月,则按最近日期加入,如下
library(data.table)
setDT(B)[, application_date := date]
B[A, on = .(customer, date = application_date), roll = 'nearest']
如果对于同一个id,日期差异大于6个月,则将所有连接的值(value1,value2,value3)填充为NA。
那么如何构造和组合left_join和ifelse条件
这是使用 fuzzyjoin 包中的 fuzzy_join()
函数的可能解决方案。
A <- data.frame(id = c(1,2,3, 3),
application_date = as.Date(c("2010-05-08", "2012-08-08", "2013-06-23", "2015-06-23")))
B <- data.frame(id = c(1,1,2,2,3,3),
date = as.Date(c("2009-01-02", "2009-12-24", "2011-11-11", "2012-05-20", "2013-03-21", "2013-06-05")),
value1 = c(2500, 3000, 1200, 1900, 5500, 4500),
value2 = c(2500, 3000, 1200, 1900, 5500, 4500),
value3 = c(2500, 3000, 1200, 1900, 5500, 4500))
library(fuzzyjoin)
library(dplyr)
#define the test
#exact match for groups, interval matching on date
test <- function(id1, id2) {
if (class(id1) == "numeric") {
result <- (id1 == id2)
}
else if (class(id1) == "Date") {
result <-( (id1 - 182) < id2 & id2 < (id1 +182) )
}
#print(result)
}
answer<-fuzzy_join(A, B, by=c("id"="id", "application_date"="date"), mode='left', match_fun=test)
#create a grouping variable
answer$uid <- paste(answer$id.x, answer$application_date)
#find the min date of match
answer %>% group_by(uid) %>%
mutate(delta= abs(application_date - date)) %>%
arrange(delta, .by_group = TRUE) %>%
slice_head(n = 1)
请注意,我没有对所有可能的边角情况进行上述测试,例如 B 中的多行与 A 中的单行匹配。
两个数据集根据它们的 id 和日期分开的条件被左连接
A <- data.frame(id = c(1,2,3),
application_date = as.Date(c("2010-05-08", "2012-08-08", "2013-06-23")))
B <- data.frame(id = c(1,1,2,2,3,3),
date = as.Date(c("2009-01-02", "2009-12-24", "2011-11-11", "2012-05-20", "2013-03-21", "2013-06-05")),
value1 = c(2500, 3000, 1200, 1900, 5500, 4500),
value2 = c(2500, 3000, 1200, 1900, 5500, 4500),
value3 = c(2500, 3000, 1200, 1900, 5500, 4500))
如果同一id,日期差小于或等于6个月,则按最近日期加入,如下
library(data.table)
setDT(B)[, application_date := date]
B[A, on = .(customer, date = application_date), roll = 'nearest']
如果对于同一个id,日期差异大于6个月,则将所有连接的值(value1,value2,value3)填充为NA。
那么如何构造和组合left_join和ifelse条件
这是使用 fuzzyjoin 包中的 fuzzy_join()
函数的可能解决方案。
A <- data.frame(id = c(1,2,3, 3),
application_date = as.Date(c("2010-05-08", "2012-08-08", "2013-06-23", "2015-06-23")))
B <- data.frame(id = c(1,1,2,2,3,3),
date = as.Date(c("2009-01-02", "2009-12-24", "2011-11-11", "2012-05-20", "2013-03-21", "2013-06-05")),
value1 = c(2500, 3000, 1200, 1900, 5500, 4500),
value2 = c(2500, 3000, 1200, 1900, 5500, 4500),
value3 = c(2500, 3000, 1200, 1900, 5500, 4500))
library(fuzzyjoin)
library(dplyr)
#define the test
#exact match for groups, interval matching on date
test <- function(id1, id2) {
if (class(id1) == "numeric") {
result <- (id1 == id2)
}
else if (class(id1) == "Date") {
result <-( (id1 - 182) < id2 & id2 < (id1 +182) )
}
#print(result)
}
answer<-fuzzy_join(A, B, by=c("id"="id", "application_date"="date"), mode='left', match_fun=test)
#create a grouping variable
answer$uid <- paste(answer$id.x, answer$application_date)
#find the min date of match
answer %>% group_by(uid) %>%
mutate(delta= abs(application_date - date)) %>%
arrange(delta, .by_group = TRUE) %>%
slice_head(n = 1)
请注意,我没有对所有可能的边角情况进行上述测试,例如 B 中的多行与 A 中的单行匹配。