使用 roll = "nearest" 限制合并的范围

Limiting the range of a merge with roll = "nearest"

我有两个要合并的数据库。从此 link:。我知道我可以合并这些 data.tables,当没有直接匹配时,最近的年份如下。:

  library(data.table)
  dfA <- fread("
  A   B   C   D   E   F   G   Z   iso   year   matchcode
  1   0   1   1   1   0   1   0   NLD   2010   NLD2010
  2   1   0   0   0   1   0   1   NLD   2014   NLD2014
  3   0   0   0   1   1   0   0   AUS   2010   AUS2010
  4   1   0   1   0   0   1   0   AUS   2006   AUS2006
  5   0   1   0   1   0   1   1   USA   2008   USA2008
  6   0   0   1   0   0   0   1   USA   2010   USA2010
  7   0   1   0   1   0   0   0   USA   2012   USA2012
  8   1   0   1   0   0   1   0   BLG   2008   BLG2008
  9   0   1   0   1   1   0   1   BEL   2008   BEL2008
  10  1   0   1   0   0   1   0   BEL   2010   BEL2010
  11  0   1   1   1   0   1   0   NLD   2010   NLD2010
  12  1   0   0   0   1   0   1   NLD   2014   NLD2014
  13  0   0   0   1   1   0   0   AUS   2010   AUS2010
  14  1   0   1   0   0   1   0   AUS   2006   AUS2006
  15  0   1   0   1   0   1   1   USA   2008   USA2008
  16  0   0   1   0   0   0   1   USA   2010   USA2010
  17  0   1   0   1   0   0   0   USA   2012   USA2012
  18  1   0   1   0   0   1   0   BLG   2008   BLG2008
  19  0   1   0   1   1   0   1   BEL   2008   BEL2008
  20  1   0   1   0   0   1   0   BEL   2010   BEL2010",
  header = TRUE)

  dfB <- fread("
  A   B   C   D   H   I   J   K   iso   year   matchcode
  1   0   1   1   1   0   1   0   NLD   2009   NLD2009
  2   1   0   0   0   1   0   1   NLD   2014   NLD2018
  3   0   0   0   1   1   0   0   AUS   2011   AUS2011
  4   1   0   1   0   0   1   0   AUS   2007   AUS2007
  5   0   1   0   1   0   1   1   USA   2007   USA2007
  6   0   0   1   0   0   0   1   USA   2010   USA2010
  7   0   1   0   1   0   0   0   USA   2013   USA2013
  8   1   0   1   0   0   1   0   BLG   2007   BLG2007
  9   0   1   0   1   1   0   1   BEL   2009   BEL2009
  10   1   0   1   0   0   1   0  BEL   2012   BEL2012",
  header = TRUE)

#change the name of the matchcode-column
setnames(dfA, c("matchcode", "iso", "year"), c("matchcodeA", "isoA", "yearA"))
setnames(dfB, c("matchcode", "iso", "year"), c("matchcodeB", "isoB", "yearB"))

#store column-order for in the end
namesA <- as.character( names( dfA ) )
namesB <- as.character( setdiff( names(dfB), names(dfA) ) )
colorder <- c(namesA, namesB)

#create columns to join on
dfA[, `:=`(iso.join = isoA, year.join = yearA)]
dfB[, `:=`(iso.join = isoB, year.join = yearB)]

#perform left join
result <- dfB[dfA, on = c("iso.join", "year.join"),roll = "nearest" ]

#drop columns that are not needed
result[, grep("^i\.", names(result)) := NULL ]
result[, grep("join$", names(result)) := NULL ]

#set column order
setcolorder(result, colorder)

关于这个我有两个问题。

1) 编辑:这个问题是打字错误的结果

2) NLD 2014in dfANLD 2018 in dfB 匹配。如果我觉得 4 年相差太多,想限制在两年内,我该怎么办?

如果我想限制 dfAdfB 之间允许的年数,我该怎么做?

您有两个选择:

  1. 使用roll = 2roll = -2这将要求最近的只有一个方向在2年内。
  2. 再向 dfA 添加两列,使其成为显式非等值连接。
#perform left join
result <- dfB[dfA, on = c("iso.join", "year.join"), roll = 2 ] 

# or
result <- dfB[dfA, on = c("iso.join", "year.join"), roll = -2 ] 

非相等连接需要您进行额外的工作,因为它不接受 roll = 'nearest' 参数,因此您需要使用 mult = 'first' 或在后续操作中进行过滤。

dfA[, `:=`(min_year.join = yearA - 2,
           max_year.join = yearA + 2)]

result <- dfB[dfA,
              on = .(iso.join,
                          year.join <= max_year.join,
                          year.join >= min_year.join)
              #, mult = 'first'
              ]

#drop columns that are not needed
result[, grep("^i\.", names(result)) := NULL ]
result[, grep("join", names(result)) := NULL ] #removed $

#set column order
setcolorder(result, colorder)
result