模糊匹配R中的两个长字符向量

Fuzzy matching two long character vectors in R

我有两个向量:Candidates$names 包含大约 45.000 个选举候选人的名字,Incumbents$names 包含大约 7600 个议员的名字。我想检查 Candidates 中的每个名称是否存在于 Incumbents 中,并在 Candidates 中创建一个新的虚拟变量 incumbent 如果是,则取值 1大小写,如果不是则为 0。

我的问题是两个列表之间的名称可能略有不同。有时名称中包含标题,有时是中间名等。因此直接匹配不能可靠地工作,但我需要一种允许一些模糊的方法。

我尝试将 expand.grid(Candidates$names, Incumbents$names)adist() 结合作为接近度指标,然后设置任意百分比(基于姓名的 distance/length)作为 cut-off 点,但结果 table 的长度在我的计算机上使 R 崩溃,并且该方法似乎不实用或不够可靠。

是否有更好的方法来执行所需的模糊匹配?

编辑:这里有一些示例向量。

Candidates <- data.frame(name = c("Barack Obama", "George W. Bush", "Jimmy Carter", "Tony Blair", "Mickey Mouse", "Darth Vader"), incumbent = NA)
Incumbents <- data.frame(name = c("Anakin Skywalker", "Sir Tony Blair", "Barack Hussein Obama", "James Carter"))

生成的数据框应如下所示:

Candidates <- data.frame(name = c("Barack Obama", "George W. Bush", "Jimmy Carter", "Tony Blair", "Mickey Mouse", "Darth Vader"), incumbent = c(1, 0, 1, 1, 0, 0)

编辑 #2:phiver 的响应非常有帮助,但我 运行 遇到了一些名称在我的数据集中不止一次出现的问题。为了唯一地识别它们,我想在匹配过程中使用一个额外的变量 Candidates$partyIncumbents$party。我如何在代码中包含这个精确的附加匹配变量?

修改我的例子:

Candidates <- data.frame(name = c("Barack Obama", "George W. Bush", "Jimmy Carter", "Tony Blair", "Mickey Mouse", "Darth Vader", "John Smith", "John Smith"), party = c("Democrat", "Republican", "Democrat", "Democrat", "Republican", "Democrat", "Democrat", "Republican") , incumbent = NA)
Incumbents <- data.frame(name = c("Anakin Skywalker", "Sir Tony Blair", "Barack Hussein Obama", "James Carter", "John Smith"), party = ("Republican", "Democrat", "Democrat", "Democrat", "Republican")

增加了共和党人约翰史密斯作为现任者后,结果应该与以前相同。

Candidates <- data.frame(name = c("Barack Obama", "George W. Bush", "Jimmy Carter", "Tony Blair", "Mickey Mouse", "Darth Vader", "John Smith", "John Smith"), party = c("Democrat", "Republican", "Democrat", "Democrat", "Republican", "Democrat", "Democrat", "Republican") , incumbent = c(1, 0, 1, 1, 0, 0, 0, 1)

使用 fuzzyjoin 包,您可以匹配连接中的名称。

对于给出的示例,此代码将重现预期的示例输出。

library(fuzzyjoin)
library(dplyr)

Candidates %>% 
  stringdist_left_join(Incumbents, by = c("name" = "name"), method = "jw", max_dist = 0.2) %>% 
  rename(candidate_name = name.x) %>% 
  mutate(incumbent = if_else(!is.na(name.y), 1, 0)) %>% 
  select(-name.y)

  candidate_name incumbent
1   Barack Obama         1
2 George W. Bush         0
3   Jimmy Carter         1
4     Tony Blair         1
5   Mickey Mouse         0
6    Darth Vader         0

现在 jw 的原因是它是为匹配只有少数错误的名称而开发的。越接近 0,名称越正确,0 表示完全相同的名称。选择正确的 max_dist 需要一些微调。清理一些名称可能会有所帮助。

要删除“Sir”等头衔,您可以直接使用 gsub 对现有名称使用正则表达式,或者在使用连接之前使用 stringr::str_remove

编辑以反映添加:

您可以扩展联接以联接多个列。

Candidates %>% 
  stringdist_left_join(Incumbents, by = c("name" = "name", "party" = "party"), 
                       method = "jw",
                       max_dist = 0.2) %>%
  rename(candidate_name = name.x,
         candidate_party = party.x) %>% 
  mutate(incumbent = if_else(!is.na(name.y), 1, 0)) %>% 
  select(-ends_with(".y")) # remove not needed columns coming from incubent table

  candidate_name candidate_party incumbent
1   Barack Obama        Democrat         1
2 George W. Bush      Republican         0
3   Jimmy Carter        Democrat         1
4     Tony Blair        Democrat         1
5   Mickey Mouse      Republican         0
6    Darth Vader        Democrat         0
7     John Smith        Democrat         0
8     John Smith      Republican         1