通过(非统一)名称连接两个数据集

Joining two datasets by (non-uniform) names

我需要加入两个数据集,两个数据集中唯一的标识符是公司名称。例如:

db1 <- tibble(
  Company = c('Bombardier Inc.','Honeywell Development Corp','The Pepsi Bottling Group (Canada), Ulc (“Pbgc”)','PepsiCo Canada ULC'),
  var1 = 1:4
)

db2 <- tibble(
  Name = c('Bombardier Inc.','Honeywell Dev Corp','The Pepsi Bottling Group (Canada), ULC','PepsiCo Canada ULC (“Pcu”)'),
  var2 = 6:9
)

显然直接 dplyr::left_join() 是行不通的。我尝试了以下无效的方法:

fuzzyjoin::regex_left_join(db1,db2,by=c('Company'='Name'))
# A tibble: 4 x 4
  Company                                          var1 Name             var2
  <chr>                                           <int> <chr>           <int>
1 Bombardier Inc.                                     1 Bombardier Inc.     6
2 Honeywell Development Corp                          2 NA                 NA
3 The Pepsi Bottling Group (Canada), Ulc (“Pbgc”)     3 NA                 NA
4 PepsiCo Canada ULC                                  4 NA                 NA

我通过删除名称中的“非必要”字符取得了一些进展:

db1 <- db1 %>% mutate(Company.alt = str_remove_all(Company,regex(
  'The|Canada|Inc|Ltd|Company|\bCo\b|Corporation|Corp|Group|ULC|[:punct:]',
  ignore_case = T
)) %>% str_squish())

db2 <- db2 %>% mutate(Name.alt = str_remove_all(Name,regex(
  'The|Canada|Inc|Ltd|Company|\bCo\b|Corporation|Corp|Group|ULC|[:punct:]',
  ignore_case = T
)) %>% str_squish())

fuzzyjoin::regex_left_join(db1,db2,by=c('Company.alt'='Name.alt'))
# A tibble: 4 x 6
  Company                                          var1 Company.alt           Name            var2 Name.alt 
  <chr>                                           <int> <chr>                 <chr>          <int> <chr>    
1 Bombardier Inc.                                     1 Bombardier            Bombardier In~     6 Bombardi~
2 Honeywell Development Corp                          2 Honeywell Development Honeywell Dev~     7 Honeywel~
3 The Pepsi Bottling Group (Canada), Ulc (“Pbgc”)     3 Pepsi Bottling Pbgc   The Pepsi Bot~     8 Pepsi Bo~
4 PepsiCo Canada ULC                                  4 PepsiCo               NA                NA NA      

但这仍然使最后一行无法匹配。为更清楚起见,Company.alt 的最后一行是 PepsiCo,这不被视为与 Name.altPepsiCo Pcu.

的最后一行模糊匹配

有没有办法成功左连接两个数据集?

试试这个:

我们可以加入 db1 db2 基于他们列的模糊字符串匹配。

使用 max_dist 我们可以定义用于加入的最大距离

参见:?stringdist_left_join

library(dplyr)
library(fuzzyjoin)

fuzzyjoin::stringdist_left_join(x=db1, y=db2, max_dist = .35, 
                                by=c('Company'='Name'), 
                                method = 'jaccard', 
                                distance_col = "dist")
  Company                                          var1 Name                                    var2  dist
  <chr>                                           <int> <chr>                                  <int> <dbl>
1 Bombardier Inc.                                     1 Bombardier Inc.                            6 0    
2 Honeywell Development Corp                          2 Honeywell Dev Corp                         7 0.133
3 The Pepsi Bottling Group (Canada), Ulc (“Pbgc”)     3 The Pepsi Bottling Group (Canada), ULC     8 0.172
4 PepsiCo Canada ULC                                  4 PepsiCo Canada ULC (“Pcu”)                 9 0.316

1) phonics phonics 包中有许多近似匹配的方法,例如soundex。其他方法见包文档。

library(dplyr)
library(phonics)

db1s <- mutate(db1, s = soundex(Company, clean = FALSE))
db2s <- mutate(db2, s = soundex(Name, clean = FALSE))
left_join(db1s, db2s)

给予:

Joining, by = "s"
# A tibble: 4 x 5
  Company                                          var1 s     Name          var2
  <chr>                                           <int> <chr> <chr>        <int>
1 Bombardier Inc.                                     1 B516  Bombardier ~     6
2 Honeywell Development Corp                          2 H543  Honeywell D~     7
3 The Pepsi Bottling Group (Canada), Ulc (“Pbgc”)     3 T112  The Pepsi B~     8
4 PepsiCo Canada ULC                                  4 P122  PepsiCo Can~     9

2) SQLite SQLite 有一个 built-in soundex 函数。

library(sqldf)

sqldf("select *
  from db1
  left join db2 on soundex(Company) = soundex(Name)")

给予:

                                          Company var1                                   Name var2
1                                 Bombardier Inc.    1                        Bombardier Inc.    6
2                      Honeywell Development Corp    2                     Honeywell Dev Corp    7
3 The Pepsi Bottling Group (Canada), Ulc (“Pbgc”)    3 The Pepsi Bottling Group (Canada), ULC    8
4                              PepsiCo Canada ULC    4             PepsiCo Canada ULC (“Pcu”)    9