dplyr 可以在 by= 中使用正则表达式连接两个数据帧吗?
Can dplyr join two data frames using a regular expression in by=?
有没有办法使用 dplyr 的连接运算符连接两个数据帧,但使用正则表达式而不是直接的 by=c('foo' = 'bar')?
类似于:
people <- data.frame(
id = 1:10
, emp = c("Caterpillar", "FEMA", "Community Hospital", "Gessert Grp.", "AT&T", "IBM Corp.", NA, "Smartguy Community College", NA, NA))
employers <- data.frame(
employerID = c(1, 2, 3, 4, 5)
, employerName = c("Caterpillar Foundation", "Eli Lilly and Company Foundation Inc.", "Archer Daniels Midland Co", "IBM Corporation", "State Farm Co. Foundation Matching Gifts")
, employerRegexp = c("Caterpillar", "El *Lilly", "Archer *Daniels|ADM", "IBM", "State *Farm")
)
peoplewRealEmployerNames <- people %>%
left_join(employers,by=c('emp' ~= 'employerRegexp')
显然,~=
不会真正起作用,但也许有类似的东西?
Dplyr 不是硬性要求,但它是我编写其余代码的风格,因此它是我的首选解决方案。
包 fuzzyjoin
正是这样做的,使用与 dplyr
大致相同的语法。
因此,您只需将代码中的最后两行更改为:
library(fuzzyjoin)
peoplewRealEmployerNames <- people %>%
regex_left_join(employers, by=c('emp' = 'employerRegexp'))
有没有办法使用 dplyr 的连接运算符连接两个数据帧,但使用正则表达式而不是直接的 by=c('foo' = 'bar')?
类似于:
people <- data.frame(
id = 1:10
, emp = c("Caterpillar", "FEMA", "Community Hospital", "Gessert Grp.", "AT&T", "IBM Corp.", NA, "Smartguy Community College", NA, NA))
employers <- data.frame(
employerID = c(1, 2, 3, 4, 5)
, employerName = c("Caterpillar Foundation", "Eli Lilly and Company Foundation Inc.", "Archer Daniels Midland Co", "IBM Corporation", "State Farm Co. Foundation Matching Gifts")
, employerRegexp = c("Caterpillar", "El *Lilly", "Archer *Daniels|ADM", "IBM", "State *Farm")
)
peoplewRealEmployerNames <- people %>%
left_join(employers,by=c('emp' ~= 'employerRegexp')
显然,~=
不会真正起作用,但也许有类似的东西?
Dplyr 不是硬性要求,但它是我编写其余代码的风格,因此它是我的首选解决方案。
包 fuzzyjoin
正是这样做的,使用与 dplyr
大致相同的语法。
因此,您只需将代码中的最后两行更改为:
library(fuzzyjoin)
peoplewRealEmployerNames <- people %>%
regex_left_join(employers, by=c('emp' = 'employerRegexp'))