使用 roll = "nearest" 限制合并的范围
Limiting the range of a merge with roll = "nearest"
我有两个要合并的数据库。从此 link:。我知道我可以合并这些 data.tables,当没有直接匹配时,最近的年份如下。:
library(data.table)
dfA <- fread("
A B C D E F G Z iso year matchcode
1 0 1 1 1 0 1 0 NLD 2010 NLD2010
2 1 0 0 0 1 0 1 NLD 2014 NLD2014
3 0 0 0 1 1 0 0 AUS 2010 AUS2010
4 1 0 1 0 0 1 0 AUS 2006 AUS2006
5 0 1 0 1 0 1 1 USA 2008 USA2008
6 0 0 1 0 0 0 1 USA 2010 USA2010
7 0 1 0 1 0 0 0 USA 2012 USA2012
8 1 0 1 0 0 1 0 BLG 2008 BLG2008
9 0 1 0 1 1 0 1 BEL 2008 BEL2008
10 1 0 1 0 0 1 0 BEL 2010 BEL2010
11 0 1 1 1 0 1 0 NLD 2010 NLD2010
12 1 0 0 0 1 0 1 NLD 2014 NLD2014
13 0 0 0 1 1 0 0 AUS 2010 AUS2010
14 1 0 1 0 0 1 0 AUS 2006 AUS2006
15 0 1 0 1 0 1 1 USA 2008 USA2008
16 0 0 1 0 0 0 1 USA 2010 USA2010
17 0 1 0 1 0 0 0 USA 2012 USA2012
18 1 0 1 0 0 1 0 BLG 2008 BLG2008
19 0 1 0 1 1 0 1 BEL 2008 BEL2008
20 1 0 1 0 0 1 0 BEL 2010 BEL2010",
header = TRUE)
dfB <- fread("
A B C D H I J K iso year matchcode
1 0 1 1 1 0 1 0 NLD 2009 NLD2009
2 1 0 0 0 1 0 1 NLD 2014 NLD2018
3 0 0 0 1 1 0 0 AUS 2011 AUS2011
4 1 0 1 0 0 1 0 AUS 2007 AUS2007
5 0 1 0 1 0 1 1 USA 2007 USA2007
6 0 0 1 0 0 0 1 USA 2010 USA2010
7 0 1 0 1 0 0 0 USA 2013 USA2013
8 1 0 1 0 0 1 0 BLG 2007 BLG2007
9 0 1 0 1 1 0 1 BEL 2009 BEL2009
10 1 0 1 0 0 1 0 BEL 2012 BEL2012",
header = TRUE)
#change the name of the matchcode-column
setnames(dfA, c("matchcode", "iso", "year"), c("matchcodeA", "isoA", "yearA"))
setnames(dfB, c("matchcode", "iso", "year"), c("matchcodeB", "isoB", "yearB"))
#store column-order for in the end
namesA <- as.character( names( dfA ) )
namesB <- as.character( setdiff( names(dfB), names(dfA) ) )
colorder <- c(namesA, namesB)
#create columns to join on
dfA[, `:=`(iso.join = isoA, year.join = yearA)]
dfB[, `:=`(iso.join = isoB, year.join = yearB)]
#perform left join
result <- dfB[dfA, on = c("iso.join", "year.join"),roll = "nearest" ]
#drop columns that are not needed
result[, grep("^i\.", names(result)) := NULL ]
result[, grep("join$", names(result)) := NULL ]
#set column order
setcolorder(result, colorder)
关于这个我有两个问题。
1) 编辑:这个问题是打字错误的结果
2) NLD 2014
in dfA
与 NLD 2018
in dfB
匹配。如果我觉得 4 年相差太多,想限制在两年内,我该怎么办?
如果我想限制 dfA
和 dfB
之间允许的年数,我该怎么做?
您有两个选择:
- 使用
roll = 2
或roll = -2
这将要求最近的只有一个方向在2年内。
- 再向
dfA
添加两列,使其成为显式非等值连接。
#perform left join
result <- dfB[dfA, on = c("iso.join", "year.join"), roll = 2 ]
# or
result <- dfB[dfA, on = c("iso.join", "year.join"), roll = -2 ]
非相等连接需要您进行额外的工作,因为它不接受 roll = 'nearest'
参数,因此您需要使用 mult = 'first'
或在后续操作中进行过滤。
dfA[, `:=`(min_year.join = yearA - 2,
max_year.join = yearA + 2)]
result <- dfB[dfA,
on = .(iso.join,
year.join <= max_year.join,
year.join >= min_year.join)
#, mult = 'first'
]
#drop columns that are not needed
result[, grep("^i\.", names(result)) := NULL ]
result[, grep("join", names(result)) := NULL ] #removed $
#set column order
setcolorder(result, colorder)
result
我有两个要合并的数据库。从此 link:
library(data.table)
dfA <- fread("
A B C D E F G Z iso year matchcode
1 0 1 1 1 0 1 0 NLD 2010 NLD2010
2 1 0 0 0 1 0 1 NLD 2014 NLD2014
3 0 0 0 1 1 0 0 AUS 2010 AUS2010
4 1 0 1 0 0 1 0 AUS 2006 AUS2006
5 0 1 0 1 0 1 1 USA 2008 USA2008
6 0 0 1 0 0 0 1 USA 2010 USA2010
7 0 1 0 1 0 0 0 USA 2012 USA2012
8 1 0 1 0 0 1 0 BLG 2008 BLG2008
9 0 1 0 1 1 0 1 BEL 2008 BEL2008
10 1 0 1 0 0 1 0 BEL 2010 BEL2010
11 0 1 1 1 0 1 0 NLD 2010 NLD2010
12 1 0 0 0 1 0 1 NLD 2014 NLD2014
13 0 0 0 1 1 0 0 AUS 2010 AUS2010
14 1 0 1 0 0 1 0 AUS 2006 AUS2006
15 0 1 0 1 0 1 1 USA 2008 USA2008
16 0 0 1 0 0 0 1 USA 2010 USA2010
17 0 1 0 1 0 0 0 USA 2012 USA2012
18 1 0 1 0 0 1 0 BLG 2008 BLG2008
19 0 1 0 1 1 0 1 BEL 2008 BEL2008
20 1 0 1 0 0 1 0 BEL 2010 BEL2010",
header = TRUE)
dfB <- fread("
A B C D H I J K iso year matchcode
1 0 1 1 1 0 1 0 NLD 2009 NLD2009
2 1 0 0 0 1 0 1 NLD 2014 NLD2018
3 0 0 0 1 1 0 0 AUS 2011 AUS2011
4 1 0 1 0 0 1 0 AUS 2007 AUS2007
5 0 1 0 1 0 1 1 USA 2007 USA2007
6 0 0 1 0 0 0 1 USA 2010 USA2010
7 0 1 0 1 0 0 0 USA 2013 USA2013
8 1 0 1 0 0 1 0 BLG 2007 BLG2007
9 0 1 0 1 1 0 1 BEL 2009 BEL2009
10 1 0 1 0 0 1 0 BEL 2012 BEL2012",
header = TRUE)
#change the name of the matchcode-column
setnames(dfA, c("matchcode", "iso", "year"), c("matchcodeA", "isoA", "yearA"))
setnames(dfB, c("matchcode", "iso", "year"), c("matchcodeB", "isoB", "yearB"))
#store column-order for in the end
namesA <- as.character( names( dfA ) )
namesB <- as.character( setdiff( names(dfB), names(dfA) ) )
colorder <- c(namesA, namesB)
#create columns to join on
dfA[, `:=`(iso.join = isoA, year.join = yearA)]
dfB[, `:=`(iso.join = isoB, year.join = yearB)]
#perform left join
result <- dfB[dfA, on = c("iso.join", "year.join"),roll = "nearest" ]
#drop columns that are not needed
result[, grep("^i\.", names(result)) := NULL ]
result[, grep("join$", names(result)) := NULL ]
#set column order
setcolorder(result, colorder)
关于这个我有两个问题。
1) 编辑:这个问题是打字错误的结果
2) NLD 2014
in dfA
与 NLD 2018
in dfB
匹配。如果我觉得 4 年相差太多,想限制在两年内,我该怎么办?
如果我想限制 dfA
和 dfB
之间允许的年数,我该怎么做?
您有两个选择:
- 使用
roll = 2
或roll = -2
这将要求最近的只有一个方向在2年内。 - 再向
dfA
添加两列,使其成为显式非等值连接。
#perform left join
result <- dfB[dfA, on = c("iso.join", "year.join"), roll = 2 ]
# or
result <- dfB[dfA, on = c("iso.join", "year.join"), roll = -2 ]
非相等连接需要您进行额外的工作,因为它不接受 roll = 'nearest'
参数,因此您需要使用 mult = 'first'
或在后续操作中进行过滤。
dfA[, `:=`(min_year.join = yearA - 2,
max_year.join = yearA + 2)]
result <- dfB[dfA,
on = .(iso.join,
year.join <= max_year.join,
year.join >= min_year.join)
#, mult = 'first'
]
#drop columns that are not needed
result[, grep("^i\.", names(result)) := NULL ]
result[, grep("join", names(result)) := NULL ] #removed $
#set column order
setcolorder(result, colorder)
result