
test if words are in a string (grepl, fuzzyjoin?)



First <- c("john", "jane", "jimmy", "jerry", "matt", "tom", "peter", "leah")
Last  <- c("smith", "doe", "mcgee", "bishop", "gibbs", "dinnozo", "lane", "palmer")
Name  <- c("mr john smith","", "timothy t mcgee", "dinnozo tom", "jane  l doe", "jimmy mcgee", "leah elizabeth arthur palmer and co", "jerry bishop the cat")
ID    <- c("ID1", "ID2", "ID3", "ID4", "ID5", "ID6", "ID7", "ID8")

df1 <- data.frame(First, Last)
df2 <- data.frame(Name, ID)

所以基本上,我有 df1,其中人名的名字和姓氏相当有序;我有 df2,其名称可以组织为“名字、姓氏”或“姓氏名字”或“名字 MI 姓氏”或其他完全包含该名称的名称。我需要 df2 中的 ID 列。所以我想运行一个代码看看df1$Firstdf2$Last是否在df2$Name的字符串中某处,如果是是让它拉动并加入 df2$IDdf1.

我的 R 大师告诉我使用 fuzzyjoin 包中的 fuzzy_left_join

fzjoin <- fuzzy_left_join(df1, df2, by = c("First" = "Name"), match_fun = "contains")

但是它给了我一个参数不合逻辑的错误;而且我不知道如何重写它来做我想做的事; documentation says that match_fun should be TRUE or FALSE, but I don't know what to do with that. Also, it only matches on df1$First rather than df1$First and df1$Last. I think I might be able to use the grepl,但不确定如何基于我看到的示例。有什么建议吗?

文档说 match_fun 应该是一个 " 给定两列的向量化函数,returning TRUEFALSE 是否它们是匹配项。 不是 TRUE 或 FALSE,它是 returns TRUEFALSE 的函数。如果我们切换您的顺序,我们可以使用 stringr::str_detect,它会根据需要执行 return TRUEFALSE

  df2, df1,
  by = c("Name" = "First", "Name" = "Last"),
  match_fun = stringr::str_detect
#                                  Name  ID First    Last
# 1                       mr john smith ID1  john   smith
# 2                                     ID2  <NA>    <NA>
# 3                     timothy t mcgee ID3  <NA>    <NA>
# 4                         dinnozo tom ID4   tom dinnozo
# 5                         jane  l doe ID5  jane     doe
# 6                         jimmy mcgee ID6 jimmy   mcgee
# 7 leah elizabeth arthur palmer and co ID7  leah  palmer
# 8                jerry bishop the cat ID8 jerry  bishop