模糊匹配R中的两个长字符向量
Fuzzy matching two long character vectors in R
我有两个向量:Candidates$names
包含大约 45.000 个选举候选人的名字,Incumbents$names
包含大约 7600 个议员的名字。我想检查 Candidates
中的每个名称是否存在于 Incumbents
中,并在 Candidates
中创建一个新的虚拟变量 incumbent
如果是,则取值 1大小写,如果不是则为 0。
我的问题是两个列表之间的名称可能略有不同。有时名称中包含标题,有时是中间名等。因此直接匹配不能可靠地工作,但我需要一种允许一些模糊的方法。
我尝试将 expand.grid(Candidates$names, Incumbents$names)
与 adist()
结合作为接近度指标,然后设置任意百分比(基于姓名的 distance/length)作为 cut-off 点,但结果 table 的长度在我的计算机上使 R 崩溃,并且该方法似乎不实用或不够可靠。
是否有更好的方法来执行所需的模糊匹配?
编辑:这里有一些示例向量。
Candidates <- data.frame(name = c("Barack Obama", "George W. Bush", "Jimmy Carter", "Tony Blair", "Mickey Mouse", "Darth Vader"), incumbent = NA)
Incumbents <- data.frame(name = c("Anakin Skywalker", "Sir Tony Blair", "Barack Hussein Obama", "James Carter"))
生成的数据框应如下所示:
Candidates <- data.frame(name = c("Barack Obama", "George W. Bush", "Jimmy Carter", "Tony Blair", "Mickey Mouse", "Darth Vader"), incumbent = c(1, 0, 1, 1, 0, 0)
编辑 #2:phiver 的响应非常有帮助,但我 运行 遇到了一些名称在我的数据集中不止一次出现的问题。为了唯一地识别它们,我想在匹配过程中使用一个额外的变量 Candidates$party
和 Incumbents$party
。我如何在代码中包含这个精确的附加匹配变量?
修改我的例子:
Candidates <- data.frame(name = c("Barack Obama", "George W. Bush", "Jimmy Carter", "Tony Blair", "Mickey Mouse", "Darth Vader", "John Smith", "John Smith"), party = c("Democrat", "Republican", "Democrat", "Democrat", "Republican", "Democrat", "Democrat", "Republican") , incumbent = NA)
Incumbents <- data.frame(name = c("Anakin Skywalker", "Sir Tony Blair", "Barack Hussein Obama", "James Carter", "John Smith"), party = ("Republican", "Democrat", "Democrat", "Democrat", "Republican")
增加了共和党人约翰史密斯作为现任者后,结果应该与以前相同。
Candidates <- data.frame(name = c("Barack Obama", "George W. Bush", "Jimmy Carter", "Tony Blair", "Mickey Mouse", "Darth Vader", "John Smith", "John Smith"), party = c("Democrat", "Republican", "Democrat", "Democrat", "Republican", "Democrat", "Democrat", "Republican") , incumbent = c(1, 0, 1, 1, 0, 0, 0, 1)
使用 fuzzyjoin 包,您可以匹配连接中的名称。
对于给出的示例,此代码将重现预期的示例输出。
library(fuzzyjoin)
library(dplyr)
Candidates %>%
stringdist_left_join(Incumbents, by = c("name" = "name"), method = "jw", max_dist = 0.2) %>%
rename(candidate_name = name.x) %>%
mutate(incumbent = if_else(!is.na(name.y), 1, 0)) %>%
select(-name.y)
candidate_name incumbent
1 Barack Obama 1
2 George W. Bush 0
3 Jimmy Carter 1
4 Tony Blair 1
5 Mickey Mouse 0
6 Darth Vader 0
现在 jw 的原因是它是为匹配只有少数错误的名称而开发的。越接近 0,名称越正确,0 表示完全相同的名称。选择正确的 max_dist 需要一些微调。清理一些名称可能会有所帮助。
要删除“Sir”等头衔,您可以直接使用 gsub
对现有名称使用正则表达式,或者在使用连接之前使用 stringr::str_remove
。
编辑以反映添加:
您可以扩展联接以联接多个列。
Candidates %>%
stringdist_left_join(Incumbents, by = c("name" = "name", "party" = "party"),
method = "jw",
max_dist = 0.2) %>%
rename(candidate_name = name.x,
candidate_party = party.x) %>%
mutate(incumbent = if_else(!is.na(name.y), 1, 0)) %>%
select(-ends_with(".y")) # remove not needed columns coming from incubent table
candidate_name candidate_party incumbent
1 Barack Obama Democrat 1
2 George W. Bush Republican 0
3 Jimmy Carter Democrat 1
4 Tony Blair Democrat 1
5 Mickey Mouse Republican 0
6 Darth Vader Democrat 0
7 John Smith Democrat 0
8 John Smith Republican 1
我有两个向量:Candidates$names
包含大约 45.000 个选举候选人的名字,Incumbents$names
包含大约 7600 个议员的名字。我想检查 Candidates
中的每个名称是否存在于 Incumbents
中,并在 Candidates
中创建一个新的虚拟变量 incumbent
如果是,则取值 1大小写,如果不是则为 0。
我的问题是两个列表之间的名称可能略有不同。有时名称中包含标题,有时是中间名等。因此直接匹配不能可靠地工作,但我需要一种允许一些模糊的方法。
我尝试将 expand.grid(Candidates$names, Incumbents$names)
与 adist()
结合作为接近度指标,然后设置任意百分比(基于姓名的 distance/length)作为 cut-off 点,但结果 table 的长度在我的计算机上使 R 崩溃,并且该方法似乎不实用或不够可靠。
是否有更好的方法来执行所需的模糊匹配?
编辑:这里有一些示例向量。
Candidates <- data.frame(name = c("Barack Obama", "George W. Bush", "Jimmy Carter", "Tony Blair", "Mickey Mouse", "Darth Vader"), incumbent = NA)
Incumbents <- data.frame(name = c("Anakin Skywalker", "Sir Tony Blair", "Barack Hussein Obama", "James Carter"))
生成的数据框应如下所示:
Candidates <- data.frame(name = c("Barack Obama", "George W. Bush", "Jimmy Carter", "Tony Blair", "Mickey Mouse", "Darth Vader"), incumbent = c(1, 0, 1, 1, 0, 0)
编辑 #2:phiver 的响应非常有帮助,但我 运行 遇到了一些名称在我的数据集中不止一次出现的问题。为了唯一地识别它们,我想在匹配过程中使用一个额外的变量 Candidates$party
和 Incumbents$party
。我如何在代码中包含这个精确的附加匹配变量?
修改我的例子:
Candidates <- data.frame(name = c("Barack Obama", "George W. Bush", "Jimmy Carter", "Tony Blair", "Mickey Mouse", "Darth Vader", "John Smith", "John Smith"), party = c("Democrat", "Republican", "Democrat", "Democrat", "Republican", "Democrat", "Democrat", "Republican") , incumbent = NA)
Incumbents <- data.frame(name = c("Anakin Skywalker", "Sir Tony Blair", "Barack Hussein Obama", "James Carter", "John Smith"), party = ("Republican", "Democrat", "Democrat", "Democrat", "Republican")
增加了共和党人约翰史密斯作为现任者后,结果应该与以前相同。
Candidates <- data.frame(name = c("Barack Obama", "George W. Bush", "Jimmy Carter", "Tony Blair", "Mickey Mouse", "Darth Vader", "John Smith", "John Smith"), party = c("Democrat", "Republican", "Democrat", "Democrat", "Republican", "Democrat", "Democrat", "Republican") , incumbent = c(1, 0, 1, 1, 0, 0, 0, 1)
使用 fuzzyjoin 包,您可以匹配连接中的名称。
对于给出的示例,此代码将重现预期的示例输出。
library(fuzzyjoin)
library(dplyr)
Candidates %>%
stringdist_left_join(Incumbents, by = c("name" = "name"), method = "jw", max_dist = 0.2) %>%
rename(candidate_name = name.x) %>%
mutate(incumbent = if_else(!is.na(name.y), 1, 0)) %>%
select(-name.y)
candidate_name incumbent
1 Barack Obama 1
2 George W. Bush 0
3 Jimmy Carter 1
4 Tony Blair 1
5 Mickey Mouse 0
6 Darth Vader 0
现在 jw 的原因是它是为匹配只有少数错误的名称而开发的。越接近 0,名称越正确,0 表示完全相同的名称。选择正确的 max_dist 需要一些微调。清理一些名称可能会有所帮助。
要删除“Sir”等头衔,您可以直接使用 gsub
对现有名称使用正则表达式,或者在使用连接之前使用 stringr::str_remove
。
编辑以反映添加:
您可以扩展联接以联接多个列。
Candidates %>%
stringdist_left_join(Incumbents, by = c("name" = "name", "party" = "party"),
method = "jw",
max_dist = 0.2) %>%
rename(candidate_name = name.x,
candidate_party = party.x) %>%
mutate(incumbent = if_else(!is.na(name.y), 1, 0)) %>%
select(-ends_with(".y")) # remove not needed columns coming from incubent table
candidate_name candidate_party incumbent
1 Barack Obama Democrat 1
2 George W. Bush Republican 0
3 Jimmy Carter Democrat 1
4 Tony Blair Democrat 1
5 Mickey Mouse Republican 0
6 Darth Vader Democrat 0
7 John Smith Democrat 0
8 John Smith Republican 1