通过(非统一)名称连接两个数据集
Joining two datasets by (non-uniform) names
我需要加入两个数据集,两个数据集中唯一的标识符是公司名称。例如:
db1 <- tibble(
Company = c('Bombardier Inc.','Honeywell Development Corp','The Pepsi Bottling Group (Canada), Ulc (“Pbgc”)','PepsiCo Canada ULC'),
var1 = 1:4
)
db2 <- tibble(
Name = c('Bombardier Inc.','Honeywell Dev Corp','The Pepsi Bottling Group (Canada), ULC','PepsiCo Canada ULC (“Pcu”)'),
var2 = 6:9
)
显然直接 dplyr::left_join()
是行不通的。我尝试了以下无效的方法:
fuzzyjoin::regex_left_join(db1,db2,by=c('Company'='Name'))
# A tibble: 4 x 4
Company var1 Name var2
<chr> <int> <chr> <int>
1 Bombardier Inc. 1 Bombardier Inc. 6
2 Honeywell Development Corp 2 NA NA
3 The Pepsi Bottling Group (Canada), Ulc (“Pbgc”) 3 NA NA
4 PepsiCo Canada ULC 4 NA NA
我通过删除名称中的“非必要”字符取得了一些进展:
db1 <- db1 %>% mutate(Company.alt = str_remove_all(Company,regex(
'The|Canada|Inc|Ltd|Company|\bCo\b|Corporation|Corp|Group|ULC|[:punct:]',
ignore_case = T
)) %>% str_squish())
db2 <- db2 %>% mutate(Name.alt = str_remove_all(Name,regex(
'The|Canada|Inc|Ltd|Company|\bCo\b|Corporation|Corp|Group|ULC|[:punct:]',
ignore_case = T
)) %>% str_squish())
fuzzyjoin::regex_left_join(db1,db2,by=c('Company.alt'='Name.alt'))
# A tibble: 4 x 6
Company var1 Company.alt Name var2 Name.alt
<chr> <int> <chr> <chr> <int> <chr>
1 Bombardier Inc. 1 Bombardier Bombardier In~ 6 Bombardi~
2 Honeywell Development Corp 2 Honeywell Development Honeywell Dev~ 7 Honeywel~
3 The Pepsi Bottling Group (Canada), Ulc (“Pbgc”) 3 Pepsi Bottling Pbgc The Pepsi Bot~ 8 Pepsi Bo~
4 PepsiCo Canada ULC 4 PepsiCo NA NA NA
但这仍然使最后一行无法匹配。为更清楚起见,Company.alt
的最后一行是 PepsiCo
,这不被视为与 Name.alt
的 PepsiCo Pcu
.
的最后一行模糊匹配
有没有办法成功左连接两个数据集?
试试这个:
我们可以加入 db1
db2
基于他们列的模糊字符串匹配。
使用 max_dist
我们可以定义用于加入的最大距离
参见:?stringdist_left_join
library(dplyr)
library(fuzzyjoin)
fuzzyjoin::stringdist_left_join(x=db1, y=db2, max_dist = .35,
by=c('Company'='Name'),
method = 'jaccard',
distance_col = "dist")
Company var1 Name var2 dist
<chr> <int> <chr> <int> <dbl>
1 Bombardier Inc. 1 Bombardier Inc. 6 0
2 Honeywell Development Corp 2 Honeywell Dev Corp 7 0.133
3 The Pepsi Bottling Group (Canada), Ulc (“Pbgc”) 3 The Pepsi Bottling Group (Canada), ULC 8 0.172
4 PepsiCo Canada ULC 4 PepsiCo Canada ULC (“Pcu”) 9 0.316
1) phonics phonics 包中有许多近似匹配的方法,例如soundex。其他方法见包文档。
library(dplyr)
library(phonics)
db1s <- mutate(db1, s = soundex(Company, clean = FALSE))
db2s <- mutate(db2, s = soundex(Name, clean = FALSE))
left_join(db1s, db2s)
给予:
Joining, by = "s"
# A tibble: 4 x 5
Company var1 s Name var2
<chr> <int> <chr> <chr> <int>
1 Bombardier Inc. 1 B516 Bombardier ~ 6
2 Honeywell Development Corp 2 H543 Honeywell D~ 7
3 The Pepsi Bottling Group (Canada), Ulc (“Pbgc”) 3 T112 The Pepsi B~ 8
4 PepsiCo Canada ULC 4 P122 PepsiCo Can~ 9
2) SQLite SQLite 有一个 built-in soundex 函数。
library(sqldf)
sqldf("select *
from db1
left join db2 on soundex(Company) = soundex(Name)")
给予:
Company var1 Name var2
1 Bombardier Inc. 1 Bombardier Inc. 6
2 Honeywell Development Corp 2 Honeywell Dev Corp 7
3 The Pepsi Bottling Group (Canada), Ulc (“Pbgc”) 3 The Pepsi Bottling Group (Canada), ULC 8
4 PepsiCo Canada ULC 4 PepsiCo Canada ULC (“Pcu”) 9
我需要加入两个数据集,两个数据集中唯一的标识符是公司名称。例如:
db1 <- tibble(
Company = c('Bombardier Inc.','Honeywell Development Corp','The Pepsi Bottling Group (Canada), Ulc (“Pbgc”)','PepsiCo Canada ULC'),
var1 = 1:4
)
db2 <- tibble(
Name = c('Bombardier Inc.','Honeywell Dev Corp','The Pepsi Bottling Group (Canada), ULC','PepsiCo Canada ULC (“Pcu”)'),
var2 = 6:9
)
显然直接 dplyr::left_join()
是行不通的。我尝试了以下无效的方法:
fuzzyjoin::regex_left_join(db1,db2,by=c('Company'='Name'))
# A tibble: 4 x 4
Company var1 Name var2
<chr> <int> <chr> <int>
1 Bombardier Inc. 1 Bombardier Inc. 6
2 Honeywell Development Corp 2 NA NA
3 The Pepsi Bottling Group (Canada), Ulc (“Pbgc”) 3 NA NA
4 PepsiCo Canada ULC 4 NA NA
我通过删除名称中的“非必要”字符取得了一些进展:
db1 <- db1 %>% mutate(Company.alt = str_remove_all(Company,regex(
'The|Canada|Inc|Ltd|Company|\bCo\b|Corporation|Corp|Group|ULC|[:punct:]',
ignore_case = T
)) %>% str_squish())
db2 <- db2 %>% mutate(Name.alt = str_remove_all(Name,regex(
'The|Canada|Inc|Ltd|Company|\bCo\b|Corporation|Corp|Group|ULC|[:punct:]',
ignore_case = T
)) %>% str_squish())
fuzzyjoin::regex_left_join(db1,db2,by=c('Company.alt'='Name.alt'))
# A tibble: 4 x 6
Company var1 Company.alt Name var2 Name.alt
<chr> <int> <chr> <chr> <int> <chr>
1 Bombardier Inc. 1 Bombardier Bombardier In~ 6 Bombardi~
2 Honeywell Development Corp 2 Honeywell Development Honeywell Dev~ 7 Honeywel~
3 The Pepsi Bottling Group (Canada), Ulc (“Pbgc”) 3 Pepsi Bottling Pbgc The Pepsi Bot~ 8 Pepsi Bo~
4 PepsiCo Canada ULC 4 PepsiCo NA NA NA
但这仍然使最后一行无法匹配。为更清楚起见,Company.alt
的最后一行是 PepsiCo
,这不被视为与 Name.alt
的 PepsiCo Pcu
.
有没有办法成功左连接两个数据集?
试试这个:
我们可以加入 db1
db2
基于他们列的模糊字符串匹配。
使用 max_dist
我们可以定义用于加入的最大距离
参见:?stringdist_left_join
library(dplyr)
library(fuzzyjoin)
fuzzyjoin::stringdist_left_join(x=db1, y=db2, max_dist = .35,
by=c('Company'='Name'),
method = 'jaccard',
distance_col = "dist")
Company var1 Name var2 dist
<chr> <int> <chr> <int> <dbl>
1 Bombardier Inc. 1 Bombardier Inc. 6 0
2 Honeywell Development Corp 2 Honeywell Dev Corp 7 0.133
3 The Pepsi Bottling Group (Canada), Ulc (“Pbgc”) 3 The Pepsi Bottling Group (Canada), ULC 8 0.172
4 PepsiCo Canada ULC 4 PepsiCo Canada ULC (“Pcu”) 9 0.316
1) phonics phonics 包中有许多近似匹配的方法,例如soundex。其他方法见包文档。
library(dplyr)
library(phonics)
db1s <- mutate(db1, s = soundex(Company, clean = FALSE))
db2s <- mutate(db2, s = soundex(Name, clean = FALSE))
left_join(db1s, db2s)
给予:
Joining, by = "s"
# A tibble: 4 x 5
Company var1 s Name var2
<chr> <int> <chr> <chr> <int>
1 Bombardier Inc. 1 B516 Bombardier ~ 6
2 Honeywell Development Corp 2 H543 Honeywell D~ 7
3 The Pepsi Bottling Group (Canada), Ulc (“Pbgc”) 3 T112 The Pepsi B~ 8
4 PepsiCo Canada ULC 4 P122 PepsiCo Can~ 9
2) SQLite SQLite 有一个 built-in soundex 函数。
library(sqldf)
sqldf("select *
from db1
left join db2 on soundex(Company) = soundex(Name)")
给予:
Company var1 Name var2
1 Bombardier Inc. 1 Bombardier Inc. 6
2 Honeywell Development Corp 2 Honeywell Dev Corp 7
3 The Pepsi Bottling Group (Canada), Ulc (“Pbgc”) 3 The Pepsi Bottling Group (Canada), ULC 8
4 PepsiCo Canada ULC 4 PepsiCo Canada ULC (“Pcu”) 9