测试单词是否在字符串中(grepl、fuzzyjoin?)

test if words are in a string (grepl, fuzzyjoin?)

如果来自一个数据框的两列的字符串包含在第二个数据框的一列的字符串中,我需要对两个数据框进行匹配和连接。

示例数据框:

First <- c("john", "jane", "jimmy", "jerry", "matt", "tom", "peter", "leah")
Last  <- c("smith", "doe", "mcgee", "bishop", "gibbs", "dinnozo", "lane", "palmer")
Name  <- c("mr john smith","", "timothy t mcgee", "dinnozo tom", "jane  l doe", "jimmy mcgee", "leah elizabeth arthur palmer and co", "jerry bishop the cat")
ID    <- c("ID1", "ID2", "ID3", "ID4", "ID5", "ID6", "ID7", "ID8")

df1 <- data.frame(First, Last)
df2 <- data.frame(Name, ID)

所以基本上,我有 df1,其中人名的名字和姓氏相当有序;我有 df2,其名称可以组织为“名字、姓氏”或“姓氏名字”或“名字 MI 姓氏”或其他完全包含该名称的名称。我需要 df2 中的 ID 列。所以我想运行一个代码看看df1$Firstdf2$Last是否在df2$Name的字符串中某处,如果是是让它拉动并加入 df2$IDdf1.

我的 R 大师告诉我使用 fuzzyjoin 包中的 fuzzy_left_join

fzjoin <- fuzzy_left_join(df1, df2, by = c("First" = "Name"), match_fun = "contains")

但是它给了我一个参数不合逻辑的错误;而且我不知道如何重写它来做我想做的事; documentation says that match_fun should be TRUE or FALSE, but I don't know what to do with that. Also, it only matches on df1$First rather than df1$First and df1$Last. I think I might be able to use the grepl,但不确定如何基于我看到的示例。有什么建议吗?

文档说 match_fun 应该是一个 " 给定两列的向量化函数,returning TRUEFALSE 是否它们是匹配项。 不是 TRUE 或 FALSE,它是 returns TRUEFALSE 的函数。如果我们切换您的顺序,我们可以使用 stringr::str_detect,它会根据需要执行 return TRUEFALSE

fuzzyjoin::fuzzy_left_join(
  df2, df1,
  by = c("Name" = "First", "Name" = "Last"),
  match_fun = stringr::str_detect
)
#                                  Name  ID First    Last
# 1                       mr john smith ID1  john   smith
# 2                                     ID2  <NA>    <NA>
# 3                     timothy t mcgee ID3  <NA>    <NA>
# 4                         dinnozo tom ID4   tom dinnozo
# 5                         jane  l doe ID5  jane     doe
# 6                         jimmy mcgee ID6 jimmy   mcgee
# 7 leah elizabeth arthur palmer and co ID7  leah  palmer
# 8                jerry bishop the cat ID8 jerry  bishop