测试单词是否在字符串中(grepl、fuzzyjoin?)
test if words are in a string (grepl, fuzzyjoin?)
如果来自一个数据框的两列的字符串包含在第二个数据框的一列的字符串中,我需要对两个数据框进行匹配和连接。
示例数据框:
First <- c("john", "jane", "jimmy", "jerry", "matt", "tom", "peter", "leah")
Last <- c("smith", "doe", "mcgee", "bishop", "gibbs", "dinnozo", "lane", "palmer")
Name <- c("mr john smith","", "timothy t mcgee", "dinnozo tom", "jane l doe", "jimmy mcgee", "leah elizabeth arthur palmer and co", "jerry bishop the cat")
ID <- c("ID1", "ID2", "ID3", "ID4", "ID5", "ID6", "ID7", "ID8")
df1 <- data.frame(First, Last)
df2 <- data.frame(Name, ID)
所以基本上,我有 df1
,其中人名的名字和姓氏相当有序;我有 df2
,其名称可以组织为“名字、姓氏”或“姓氏名字”或“名字 MI 姓氏”或其他完全包含该名称的名称。我需要 df2
中的 ID
列。所以我想运行一个代码看看df1$First
和df2$Last
是否在df2$Name
的字符串中某处,如果是是让它拉动并加入 df2$ID
到 df1
.
我的 R 大师告诉我使用 fuzzyjoin
包中的 fuzzy_left_join
:
fzjoin <- fuzzy_left_join(df1, df2, by = c("First" = "Name"), match_fun = "contains")
但是它给了我一个参数不合逻辑的错误;而且我不知道如何重写它来做我想做的事; documentation says that match_fun
should be TRUE
or FALSE
, but I don't know what to do with that. Also, it only matches on df1$First
rather than df1$First
and df1$Last
. I think I might be able to use the grepl,但不确定如何基于我看到的示例。有什么建议吗?
文档说 match_fun
应该是一个 " 给定两列的向量化函数,returning TRUE
或 FALSE
是否它们是匹配项。 不是 TRUE 或 FALSE,它是 returns TRUE
或 FALSE
的函数。如果我们切换您的顺序,我们可以使用 stringr::str_detect
,它会根据需要执行 return TRUE
或 FALSE
。
fuzzyjoin::fuzzy_left_join(
df2, df1,
by = c("Name" = "First", "Name" = "Last"),
match_fun = stringr::str_detect
)
# Name ID First Last
# 1 mr john smith ID1 john smith
# 2 ID2 <NA> <NA>
# 3 timothy t mcgee ID3 <NA> <NA>
# 4 dinnozo tom ID4 tom dinnozo
# 5 jane l doe ID5 jane doe
# 6 jimmy mcgee ID6 jimmy mcgee
# 7 leah elizabeth arthur palmer and co ID7 leah palmer
# 8 jerry bishop the cat ID8 jerry bishop
如果来自一个数据框的两列的字符串包含在第二个数据框的一列的字符串中,我需要对两个数据框进行匹配和连接。
示例数据框:
First <- c("john", "jane", "jimmy", "jerry", "matt", "tom", "peter", "leah")
Last <- c("smith", "doe", "mcgee", "bishop", "gibbs", "dinnozo", "lane", "palmer")
Name <- c("mr john smith","", "timothy t mcgee", "dinnozo tom", "jane l doe", "jimmy mcgee", "leah elizabeth arthur palmer and co", "jerry bishop the cat")
ID <- c("ID1", "ID2", "ID3", "ID4", "ID5", "ID6", "ID7", "ID8")
df1 <- data.frame(First, Last)
df2 <- data.frame(Name, ID)
所以基本上,我有 df1
,其中人名的名字和姓氏相当有序;我有 df2
,其名称可以组织为“名字、姓氏”或“姓氏名字”或“名字 MI 姓氏”或其他完全包含该名称的名称。我需要 df2
中的 ID
列。所以我想运行一个代码看看df1$First
和df2$Last
是否在df2$Name
的字符串中某处,如果是是让它拉动并加入 df2$ID
到 df1
.
我的 R 大师告诉我使用 fuzzyjoin
包中的 fuzzy_left_join
:
fzjoin <- fuzzy_left_join(df1, df2, by = c("First" = "Name"), match_fun = "contains")
但是它给了我一个参数不合逻辑的错误;而且我不知道如何重写它来做我想做的事; documentation says that match_fun
should be TRUE
or FALSE
, but I don't know what to do with that. Also, it only matches on df1$First
rather than df1$First
and df1$Last
. I think I might be able to use the grepl,但不确定如何基于我看到的示例。有什么建议吗?
文档说 match_fun
应该是一个 " 给定两列的向量化函数,returning TRUE
或 FALSE
是否它们是匹配项。 不是 TRUE 或 FALSE,它是 returns TRUE
或 FALSE
的函数。如果我们切换您的顺序,我们可以使用 stringr::str_detect
,它会根据需要执行 return TRUE
或 FALSE
。
fuzzyjoin::fuzzy_left_join(
df2, df1,
by = c("Name" = "First", "Name" = "Last"),
match_fun = stringr::str_detect
)
# Name ID First Last
# 1 mr john smith ID1 john smith
# 2 ID2 <NA> <NA>
# 3 timothy t mcgee ID3 <NA> <NA>
# 4 dinnozo tom ID4 tom dinnozo
# 5 jane l doe ID5 jane doe
# 6 jimmy mcgee ID6 jimmy mcgee
# 7 leah elizabeth arthur palmer and co ID7 leah palmer
# 8 jerry bishop the cat ID8 jerry bishop