通过字符串匹配匹配合并两个数据框

Match and merge two data frame by string matching

我有两个要合并的数据框,但是我想按地区合并它们。唯一的问题是一个数据框中的地区名称可能有额外的空格、逗号、不同的 upper-case/lower-case 字母或添加的单词。所以我想根据主区名称来匹配两者。例如:

df1
County of Herefordshire

df2
Herefordshire, Count of

#merged together pick the first data frame name

df12
County of Herefordshire

我原以为charmatch会起作用,但它似乎只对完整的比赛有效,结果我输掉了很多比赛。

有什么建议吗?

可重现代码:

#Dataframe 1
structure(list(UK_Districts = c("Aberdeen City", "Aberdeenshire", 
"Abertawe - Swansea", "Adur District", "Allerdale District (B)", 
"Amber Valley District (B)", "Angus", "Argyll and Bute", "Arun District", 
"Ashfield District", "Ashford District (B)", "Babergh District", 
"Barking and Dagenham London Boro", "Barnet London Boro", "Barnsley District (B)", 
"Barrow-in-Furness District (B)", "Basildon District (B)", "Basingstoke and Deane District (B)", 
"Bassetlaw District", "Bath and North East Somerset", "Bedford (B)", 
"Bexley London Boro", "Birmingham District (B)", "Blaby District", 
"Blackburn with Darwen (B)", "Blackpool (B)", "Blaenau Gwent - Blaenau Gwent", 
"Bolsover District", "Bolton District (B)", "Boston District (B)", 
"Bournemouth, Christchurch and Poole", "Bracknell Forest (B)", 
"Bradford District (B)", "Braintree District", "Breckland District", 
"Brent London Boro", "Brentwood District (B)", "Bro Morgannwg - the Vale of Glamorgan", 
"Broadland District", "Bromley London Boro"), `2018` = c(8L, 
2L, 14L, 14L, 14L, 14L, 3L, 5L, 14L, 14L, 14L, 14L, 14L, 14L, 
14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 
14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L
)), row.names = c(NA, -40L), class = c("tbl_df", "tbl", "data.frame"
))

#Dataframe 2
structure(list(UK_Districts = c("Adur", "Allerdale", "Amber Valley", 
"Arun", "Ashfield", "Ashford", "Babergh", "Barking and Dagenham", 
"Barnet", "Barnsley", "Barrow-in-Furness", "Basildon", "Basingstoke and Deane", 
"Bassetlaw", "Bath and North East Somerset", "Bedford", "Bexley", 
"Birmingham", "Blaby", "Blackburn with Darwen", "Blackpool", 
"Blaenau Gwent", "Bolsover", "Bolton", "Boston", "Bournemouth, Christchurch and Poole", 
"Bracknell Forest", "Bradford", "Braintree", "Breckland", "Brent", 
"Brentwood", "Bridgend", "Brighton and Hove", "Bristol, City of", 
"Broadland", "Bromley", "Bromsgrove", "Broxbourne", "Broxtowe"
), population_2018 = c(63869, 97527, 126678, 159827, 127151, 
129281, 91401, 211998, 392140, 245199, 67137, 185862, 175729, 
116839, 192106, 171623, 247258, 1141374, 100421, 148942, 139305, 
69713, 79530, 285372, 69366, 395784, 121676, 537173, 151561, 
139329, 330795, 76550, 144876, 290395, 463405, 129464, 331096, 
98662, 96876, 113272)), row.names = c(NA, 40L), class = "data.frame")

很难干净地加入他们。您可以尝试 fuzzyjoin 包。在下面的代码中,我根据两个 UK_Districts 列的字符串距离加入了两个数据框。 stringdist_full_join() 或其变体的 method 参数中提供了多种字符串距离算法。在这里,我使用了 Jaro–Winkler 距离。通过目测,0.25 的阈值似乎给出了合理的匹配。

library(tidyverse)
library(fuzzyjoin)

distance_join_df <- stringdist_full_join(
  dat1 %>% select(UK_Districts),
  dat2 %>% select(UK_Districts),
  by = "UK_Districts",
  method = "jw", distance_col = "dist"
) %>% 
  arrange(UK_Districts.x, dist) %>% 
  group_by(UK_Districts.x) %>% 
  slice(1) %>% 
  ungroup() %>% 
  mutate(UK_Districts.y = if_else(dist < 0.25, UK_Districts.y, NA_character_)) %>% 
  left_join(dat1, by = c("UK_Districts.x" = "UK_Districts")) %>% 
  left_join(dat2, by = c("UK_Districts.y" = "UK_Districts"))

distance_join_df

# # A tibble: 40 x 5
#    UK_Districts.x            UK_Districts.y  dist `2018` population_2018
#    <chr>                     <chr>          <dbl>  <int>           <dbl>
#  1 Aberdeen City             NA             0.340      8              NA
#  2 Aberdeenshire             NA             0.340      2              NA
#  3 Abertawe - Swansea        NA             0.389     14              NA
#  4 Adur District             Adur           0.231     14           63869
#  5 Allerdale District (B)    Allerdale      0.197     14           97527
#  6 Amber Valley District (B) Amber Valley   0.173     14          126678
# <Omitted>