具有日期差异的 sqldf 查询

sqldf query with date difference

我正在尝试对以下两个数据库(摘录)的颜色和日期差异执行(简化!)查询:

A                           B   
    A.COL   A.TIME              B.COL   B.TIME
1   blue    2009-01-31      1   blue    2007-01-31
2   blue    2009-02-28      2   blue    2008-12-31
3   blue    2009-03-31      3   blue    2009-02-28
4   blue    2009-04-30      4   blue    2009-04-30
5   blue    2009-05-31      5   blue    2009-06-30
6   blue    2009-06-30      6   blue    2016-08-31
7   blue    2016-03-31
8   blue    2016-04-30
9   red ...
10  red ...

我想做的是:根据COL和TIME的差异合并表,即两次之间的差异不能大于或小于2个月(或者换句话说介于-2之间和 +2,取决于从哪个日期开始)。

# For example starting with observation 1 from A, that would imply 2 matches:
2009-01-31 matched to 2008-12-31 (diff = 1)
2009-01-31 matched to 2009-02-28  (diff = -1)

# for obs 2 from A, that would imply 
2009-02-28 matched to 2008-12-31 (diff = 2)
2009-02-28 matched to 2009-02-28 (diff = 0)
2009-02-28 matched to 2009-04-30 (diff = -2)

等 我在考虑某种日期差异函数,要么来自 lubridate,这在少于 30 天的月份的情况下是有问题的,有时会产生 NA,或者 as.yearmon 来自 zoo,至少可以正确计算差异。但是,我无法将其正确实现到 sqldf(错误:语句错误:"as" 附近:语法错误)。原因似乎是不能将每个 R 函数与 sqldf 一起使用。 任何想法如何在 R 中完成?我也在寻找一种优雅的方法来相互减去月份。 lubridate 存在这个问题: Add/subtract 6 months (bond time) in R using lubridate, but here was one proposed way how to accomplish it with zoo: Get the difference between dates in terms of weeks, months, quarters, and years

获取数据(感谢下方的@bouncyball 提供代码):

A <- read.table(
  text = "
  A.COL   A.TIME          
  blue    2009-01-31     
  blue    2009-02-28      
  blue    2009-03-31      
  blue    2009-04-30      
  blue    2009-05-31      
  blue    2009-06-30
  blue    2016-03-31
  blue    2016-04-30
  ", header = T, stringsAsFactors = FALSE)


B <- read.table(
  text = "
  B.COL   B.TIME
  blue    2007-01-31
  blue    2008-12-31
  blue    2009-02-28
  blue    2009-04-30
  blue    2009-06-30
  blue    2016-08-31
  ", stringsAsFactors = FALSE, header = T)

这是一个使用 this SO postplyr 包中的函数的解决方案:

library(plyr)

# turn a date into a 'monthnumber' relative to an origin
monnb <- function(d) { 
  lt <- as.POSIXlt(as.Date(d, origin="1900-01-01"))
  lt$year*12 + lt$mon 
  } 

# compute a month difference as a difference between two monnb's
mondf <- function(d1, d2) { monnb(d2) - monnb(d1) }

# iterate over rows of A looking for matches in B
adply(A, 1, function(x)
  B[x$A.COL == B$B.COL & 
      abs(mondf(as.Date(x$A.TIME), as.Date(B$B.TIME))) <= 2,]
)

#     A.COL    A.TIME  B.COL    B.TIME
# 1   blue 2009-01-31  blue 2008-12-31
# 2   blue 2009-01-31  blue 2009-02-28
# 3   blue 2009-02-28  blue 2008-12-31
# 4   blue 2009-02-28  blue 2009-02-28
# 5   blue 2009-02-28  blue 2009-04-30
#  ....

编辑:data.table 实施

library(data.table)
merge_AB <- data.table(merge(A,B, by.x = 'A.COL', by.y = 'B.COL'))

merge_AB[,DateDiff := abs(mondf(A.TIME, B.TIME))
       ][DateDiff <= 2]

 #     A.COL     A.TIME     B.TIME DateDiff
 # 1:  blue 2009-01-31 2008-12-31        1
 # 2:  blue 2009-01-31 2009-02-28        1
 # 3:  blue 2009-02-28 2008-12-31        2
 # 4:  blue 2009-02-28 2009-02-28        0
 # 5:  blue 2009-02-28 2009-04-30        2
 # ...

数据

A <- read.table(
text = "
A.COL   A.TIME          
blue    2009-01-31     
blue    2009-02-28      
blue    2009-03-31      
blue    2009-04-30      
blue    2009-05-31      
blue    2009-06-30
blue    2016-03-31
blue    2016-04-30
", header = T, stringsAsFactors = FALSE)


B <- read.table(
  text = "
B.COL   B.TIME
blue    2007-01-31
blue    2008-12-31
blue    2009-02-28
blue    2009-04-30
blue    2009-06-30
blue    2016-08-31
", stringsAsFactors = FALSE, header = T)