dplyr 可以在 by= 中使用正则表达式连接两个数据帧吗?

Can dplyr join two data frames using a regular expression in by=?

有没有办法使用 dplyr 的连接运算符连接两个数据帧,但使用正则表达式而不是直接的 by=c('foo' = 'bar')?

类似于:

people  <- data.frame(
     id = 1:10
   , emp = c("Caterpillar", "FEMA", "Community Hospital", "Gessert Grp.", "AT&T", "IBM Corp.", NA, "Smartguy Community College", NA, NA))



employers  <- data.frame(
     employerID     = c(1, 2, 3, 4, 5)
   , employerName   = c("Caterpillar Foundation", "Eli Lilly and Company Foundation Inc.", "Archer Daniels Midland Co", "IBM Corporation", "State Farm Co.  Foundation Matching Gifts")
   , employerRegexp = c("Caterpillar", "El *Lilly", "Archer *Daniels|ADM", "IBM", "State *Farm")
   )

peoplewRealEmployerNames  <- people  %>% 
     left_join(employers,by=c('emp' ~= 'employerRegexp')

显然,~= 不会真正起作用,但也许有类似的东西?

Dplyr 不是硬性要求,但它是我编写其余代码的风格,因此它是我的首选解决方案。

fuzzyjoin 正是这样做的,使用与 dplyr 大致相同的语法。

因此,您只需将代码中的最后两行更改为:

library(fuzzyjoin)

peoplewRealEmployerNames <- people %>%
  regex_left_join(employers, by=c('emp' = 'employerRegexp'))