结合 data.table 中的合并执行 "fuzzyjoin"(和非模糊连接)
Doing a "fuzzyjoin" (and non-fuzzyjoin) in combination with a merge in data.table
我正在使用多个数据库。对于这些数据库中的每一个,我都创建了一个名为 matchcode
的密钥。 matchcode
是国家代码和年份的组合。大多数情况下,当我合并这些数据集时,我只是这样做:
dfA<- merge(dfA, dfB, by= "matchcode", all.x = TRUE, allow.cartesian=FALSE)
问题是有时年份不完全匹配:
dfA <- read.table(
text = "A B C D E F G iso year matchcode
1 0 1 1 1 0 1 0 NLD 2010 NLD2010
2 1 0 0 0 1 0 1 NLD 2014 NLD2014
3 0 0 0 1 1 0 0 AUS 2010 AUS2010
4 1 0 1 0 0 1 0 AUS 2006 AUS2006
5 0 1 0 1 0 1 1 USA 2008 USA2008
6 0 0 1 0 0 0 1 USA 2010 USA2010
7 0 1 0 1 0 0 0 USA 2012 USA2012
8 1 0 1 0 0 1 0 BLG 2008 BLG2008
9 0 1 0 1 1 0 1 BEL 2008 BEL2008
10 1 0 1 0 0 1 0 BEL 2010 BEL2010",
header = TRUE
)
dfB <- read.table(
text = "K L M N O P Q iso year matchcode
1 0 1 1 1 0 1 0 NLD 2009 NLD2009
2 1 0 0 0 1 0 1 NLD 2014 NLD2014
3 0 0 0 1 1 0 0 AUS 2011 AUS2011
4 1 0 1 0 0 1 0 AUS 2007 AUS2007
5 0 1 0 1 0 1 1 USA 2007 USA2007
6 0 0 1 0 0 0 1 USA 2011 USA2010
7 0 1 0 1 0 0 0 USA 2013 USA2013
8 1 0 1 0 0 1 0 BLG 2007 BLG2007
9 0 1 0 1 1 0 1 BEL 2009 BEL2009
10 1 0 1 0 0 1 0 BEL 2012 BEL2012",
header = TRUE
)
我正在寻找一个 data.table 解决方案,它的作用类似于以下 :
library(data.table)
setDT(dfA)
setDT(dfB)
dfA[dfB
, on = .(iso, year)
, roll = "nearest"
, .(ID, year.x = i.year, year.y = x.year, value, delta = abs(i.year- x.year))]
但是,除了这个解决方案之外,我还想:
- 在新 data.table.
中包含两个数据库的所有列
- 将滚动限制为
+1 | -1
。但是当我输入这个时,它似乎没有正确应用它。
对于 1. 我显然需要使用 mget,again thanks to Jaap:
dfA[dfB, on = .(iso, year), names(dfB)[1:10] :=
mget(paste0("i.", names(dfB)[1:10]))]
然而,我似乎无法成功地将它们组合在一起。我试过了:
dfA[dfB, on = .(iso, year), roll = "nearest", names(dfB)[1:10] :=
mget(paste0("i.", names(dfB)[1:10])),
.(matchcode, year.x = i.year, year.y = x.year, delta = abs(i.year - x.year))]
但这给出了:
Error in eval(bysub, xss, parent.frame()) : object 'i.year' not found.
以下是有效的:
dfA[dfB
, on = .(iso, year)
, roll = "nearest"
, .(matchcode, year.x = i.year, year.y = x.year, delta = abs(i.year - x.year))]
matchcode year.x year.y delta
1: NLD2010 2009 2010 1
2: NLD2014 2014 2014 0
3: AUS2010 2011 2010 1
4: AUS2006 2007 2006 1
5: USA2008 2007 2008 1
6: USA2010 2011 2010 1
7: USA2012 2013 2012 1
8: BLG2008 2007 2008 1
9: BEL2008 2009 2008 1
10: BEL2010 2012 2010 2
对如何进行有什么建议吗?
希望这对你有用:
dfA[, yearA := year]
res1 <- dfA[dfB, on = .(iso, year), roll = 1, nomatch = 0]
res2 <- dfA[dfB, on = .(iso, year), roll = -1, nomatch = 0]
res <- rbind(res1, res2[yearA > year])
setnames(res, c('year', 'matchcode', 'i.matchcode'), c('yearB', 'matchcodeA', 'matchcodeB'))
# A B C D E F G iso yearB matchcodeA yearA K L M N O P Q matchcodeB
# 1: 1 0 0 0 1 0 1 NLD 2014 NLD2014 2014 1 0 0 0 1 0 1 NLD2014
# 2: 0 0 0 1 1 0 0 AUS 2011 AUS2010 2010 0 0 0 1 1 0 0 AUS2011
# 3: 1 0 1 0 0 1 0 AUS 2007 AUS2006 2006 1 0 1 0 0 1 0 AUS2007
# 4: 0 0 1 0 0 0 1 USA 2011 USA2010 2010 0 0 1 0 0 0 1 USA2010
# 5: 0 1 0 1 0 0 0 USA 2013 USA2012 2012 0 1 0 1 0 0 0 USA2013
# 6: 0 1 0 1 1 0 1 BEL 2009 BEL2008 2008 0 1 0 1 1 0 1 BEL2009
# 7: 0 1 1 1 0 1 0 NLD 2009 NLD2010 2010 0 1 1 1 0 1 0 NLD2009
# 8: 0 1 0 1 0 1 1 USA 2007 USA2008 2008 0 1 0 1 0 1 1 USA2007
# 9: 0 1 0 1 0 0 0 USA 2011 USA2012 2012 0 0 1 0 0 0 1 USA2010
# 10: 1 0 1 0 0 1 0 BLG 2007 BLG2008 2008 1 0 1 0 0 1 0 BLG2007
# 11: 1 0 1 0 0 1 0 BEL 2009 BEL2010 2010 0 1 0 1 1 0 1 BEL2009
我正在使用多个数据库。对于这些数据库中的每一个,我都创建了一个名为 matchcode
的密钥。 matchcode
是国家代码和年份的组合。大多数情况下,当我合并这些数据集时,我只是这样做:
dfA<- merge(dfA, dfB, by= "matchcode", all.x = TRUE, allow.cartesian=FALSE)
问题是有时年份不完全匹配:
dfA <- read.table(
text = "A B C D E F G iso year matchcode
1 0 1 1 1 0 1 0 NLD 2010 NLD2010
2 1 0 0 0 1 0 1 NLD 2014 NLD2014
3 0 0 0 1 1 0 0 AUS 2010 AUS2010
4 1 0 1 0 0 1 0 AUS 2006 AUS2006
5 0 1 0 1 0 1 1 USA 2008 USA2008
6 0 0 1 0 0 0 1 USA 2010 USA2010
7 0 1 0 1 0 0 0 USA 2012 USA2012
8 1 0 1 0 0 1 0 BLG 2008 BLG2008
9 0 1 0 1 1 0 1 BEL 2008 BEL2008
10 1 0 1 0 0 1 0 BEL 2010 BEL2010",
header = TRUE
)
dfB <- read.table(
text = "K L M N O P Q iso year matchcode
1 0 1 1 1 0 1 0 NLD 2009 NLD2009
2 1 0 0 0 1 0 1 NLD 2014 NLD2014
3 0 0 0 1 1 0 0 AUS 2011 AUS2011
4 1 0 1 0 0 1 0 AUS 2007 AUS2007
5 0 1 0 1 0 1 1 USA 2007 USA2007
6 0 0 1 0 0 0 1 USA 2011 USA2010
7 0 1 0 1 0 0 0 USA 2013 USA2013
8 1 0 1 0 0 1 0 BLG 2007 BLG2007
9 0 1 0 1 1 0 1 BEL 2009 BEL2009
10 1 0 1 0 0 1 0 BEL 2012 BEL2012",
header = TRUE
)
我正在寻找一个 data.table 解决方案,它的作用类似于以下
library(data.table)
setDT(dfA)
setDT(dfB)
dfA[dfB
, on = .(iso, year)
, roll = "nearest"
, .(ID, year.x = i.year, year.y = x.year, value, delta = abs(i.year- x.year))]
但是,除了这个解决方案之外,我还想:
- 在新 data.table. 中包含两个数据库的所有列
- 将滚动限制为
+1 | -1
。但是当我输入这个时,它似乎没有正确应用它。
对于 1. 我显然需要使用 mget,again thanks to Jaap:
dfA[dfB, on = .(iso, year), names(dfB)[1:10] :=
mget(paste0("i.", names(dfB)[1:10]))]
然而,我似乎无法成功地将它们组合在一起。我试过了:
dfA[dfB, on = .(iso, year), roll = "nearest", names(dfB)[1:10] :=
mget(paste0("i.", names(dfB)[1:10])),
.(matchcode, year.x = i.year, year.y = x.year, delta = abs(i.year - x.year))]
但这给出了:
Error in eval(bysub, xss, parent.frame()) : object 'i.year' not found.
以下是有效的:
dfA[dfB
, on = .(iso, year)
, roll = "nearest"
, .(matchcode, year.x = i.year, year.y = x.year, delta = abs(i.year - x.year))]
matchcode year.x year.y delta
1: NLD2010 2009 2010 1
2: NLD2014 2014 2014 0
3: AUS2010 2011 2010 1
4: AUS2006 2007 2006 1
5: USA2008 2007 2008 1
6: USA2010 2011 2010 1
7: USA2012 2013 2012 1
8: BLG2008 2007 2008 1
9: BEL2008 2009 2008 1
10: BEL2010 2012 2010 2
对如何进行有什么建议吗?
希望这对你有用:
dfA[, yearA := year]
res1 <- dfA[dfB, on = .(iso, year), roll = 1, nomatch = 0]
res2 <- dfA[dfB, on = .(iso, year), roll = -1, nomatch = 0]
res <- rbind(res1, res2[yearA > year])
setnames(res, c('year', 'matchcode', 'i.matchcode'), c('yearB', 'matchcodeA', 'matchcodeB'))
# A B C D E F G iso yearB matchcodeA yearA K L M N O P Q matchcodeB
# 1: 1 0 0 0 1 0 1 NLD 2014 NLD2014 2014 1 0 0 0 1 0 1 NLD2014
# 2: 0 0 0 1 1 0 0 AUS 2011 AUS2010 2010 0 0 0 1 1 0 0 AUS2011
# 3: 1 0 1 0 0 1 0 AUS 2007 AUS2006 2006 1 0 1 0 0 1 0 AUS2007
# 4: 0 0 1 0 0 0 1 USA 2011 USA2010 2010 0 0 1 0 0 0 1 USA2010
# 5: 0 1 0 1 0 0 0 USA 2013 USA2012 2012 0 1 0 1 0 0 0 USA2013
# 6: 0 1 0 1 1 0 1 BEL 2009 BEL2008 2008 0 1 0 1 1 0 1 BEL2009
# 7: 0 1 1 1 0 1 0 NLD 2009 NLD2010 2010 0 1 1 1 0 1 0 NLD2009
# 8: 0 1 0 1 0 1 1 USA 2007 USA2008 2008 0 1 0 1 0 1 1 USA2007
# 9: 0 1 0 1 0 0 0 USA 2011 USA2012 2012 0 0 1 0 0 0 1 USA2010
# 10: 1 0 1 0 0 1 0 BLG 2007 BLG2008 2008 1 0 1 0 0 1 0 BLG2007
# 11: 1 0 1 0 0 1 0 BEL 2009 BEL2010 2010 0 1 0 1 1 0 1 BEL2009