在条件下加入两个数据帧(grepl)
Joining two dataframes on a condition (grepl)
我希望根据条件连接两个数据帧,在本例中,一个字符串在另一个字符串中。假设我有两个数据框,
df1 <- data.frame(fullnames=c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"),
ages = c(30, 51, 45, 38, 20))
fullnames ages
1 Jane Doe 30
2 Mr. John Smith 51
3 Nate Cox, Esq. 45
4 Bill Lee III 38
5 Ms. Kate Smith 20
df2 <- data.frame(lastnames=c("Doe", "Cox", "Smith", "Jung", "Smith", "Lee"),
ages=c(30, 45, 20, 28, 51, 38),
homestate=c("NJ", "CT", "MA", "RI", "MA", "NY"))
lastnames ages homestate
1 Doe 30 NJ
2 Cox 45 CT
3 Smith 20 MA
4 Jung 28 RI
5 Smith 51 MA
6 Lee 38 NY
我想对这两个关于年龄的数据框和 df2$lastnames
包含在 df1$fullnames
中的行进行左连接。我认为 fuzzy_join
可以做到,但我认为它不喜欢我的 grepl
:
joined_dfs <- fuzzy_join(df1, df2, by = c("ages", "fullnames"="lastnames"),
+ match_fun = c("=", "grepl()"),
+ mode="left")
Error in which(m) : argument to 'which' is not logical
期望的结果:一个与第一个相同但附加了“homestate”列的数据框。有什么想法吗?
TLDR
你只需要修复match_fun
:
# ...
match_fun = list(`==`, stringr::str_detect),
# ...
背景
你的想法是对的,但是你对fuzzyjoin::fuzzy_join()
. Per the documentation中的match_fun
参数的解释错了,match_fun
应该是
Vectorized function given two columns, returning TRUE or FALSE as to whether they are a match. Can be a list of functions one for each pair of columns specified in by
(if a named list, it uses the names in x). If only one function is given it is used on all column pairs.
解决方案
通过 dplyr
进一步格式化,一个简单的更正就可以解决问题。为了概念清晰,我在排版上将 by
列与用于匹配它们的 function
对齐:
library(dplyr)
# ...
# Existing code
# ...
joined_dfs <- fuzzy_join(
df1, df2,
by = c("ages", "fullnames" = "lastnames"),
# |----| |-----------------------|
match_fun = list(`==` , stringr::str_detect ),
# |--| |-----------------|
# Match by equality ^ ^ Match by detection of `lastnames` in `fullnames`
mode = "left"
) %>%
# Format resulting dataset as you requested.
select(fullnames, ages = ages.x, homestate)
结果
鉴于您在此处复制的样本数据
df1 <- data.frame(
fullnames = c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"),
ages = c(30, 51, 45, 38, 20)
)
df2 <- data.frame(
lastnames = c("Doe", "Cox", "Smith", "Jung", "Smith", "Lee"),
ages = c(30, 45, 20, 28, 51, 38),
homestate = c("NJ", "CT", "MA", "RI", "MA", "NY")
)
此解决方案应为 joined_dfs
生成以下 data.frame
,格式符合要求:
fullnames ages homestate
1 Jane Doe 30 NJ
2 Mr. John Smith 51 MA
3 Nate Cox, Esq. 45 CT
4 Bill Lee III 38 NY
5 Ms. Kate Smith 20 MA
备注
因为每个ages
恰好是一个唯一的key,下面的join on only only *names
fuzzy_join(
df1, df2,
by = c("fullnames" = "lastnames"),
match_fun = stringr::str_detect,
mode = "left"
)
将更好地说明匹配子字符串的行为:
fullnames ages.x lastnames ages.y homestate
1 Jane Doe 30 Doe 30 NJ
2 Mr. John Smith 51 Smith 20 MA
3 Mr. John Smith 51 Smith 51 MA
4 Nate Cox, Esq. 45 Cox 45 CT
5 Bill Lee III 38 Lee 38 NY
6 Ms. Kate Smith 20 Smith 20 MA
7 Ms. Kate Smith 20 Smith 51 MA
你错在哪里
类型错误
传递给 match_fun
的值应该是(symbol
for) a function
fuzzyjoin::fuzzy_join(
# ...
match_fun = grepl
# ...
)
或一个 list
这样的 (symbol
s for) function
s:
fuzzyjoin::fuzzy_join(
# ...
match_fun = list(`=`, grepl)
# ...
)
而不是提供 list
的 symbol
s
match_fun = list(=, grepl)
您错误地提供了 vector
of character
个字符串:
match_fun = c("=", "grepl()")
语法错误
用户应该name function
s
`=`
grepl
然而你错误地试图呼叫他们:
=
grepl()
命名它们会按预期将 function
的 本身 传递给 match_fun
,而调用它们会传递它们的 return 值*。在 R 中,像 =
这样的运算符使用反引号命名:`=`
.
* 假设调用没有因错误而失败。在这里,他们会失败。
不合适的函数
要比较两个值是否相等,这里是 character
向量 df1$fullnames
和 df2$lastnames
,您应该使用 关系运算符 ==
; yet you incorrectly supplied the assignment operator =
。
此外grepl()
is not vectorized in quite the way match_fun
desires. While its second argument(x
)确实是一个向量
a character vector where matches are sought, or an object which can be coerced by as.character to a character vector. Long vectors are supported.
它的第一个 argument (pattern
) 是(被视为)单个 character
字符串:
character string containing a regular expression (or character string for fixed = TRUE
) to be matched in the given character vector. Coerced by as.character
to a character string if possible. If a character vector of length 2 or more is supplied, the first element is used with a warning. Missing values are allowed except for regexpr
, gregexpr
and regexec
.
因此,grepl()
不是
Vectorized function given two columns...
而是 function
给定一个字符串(标量)和一列(向量)字符串。
你祈祷的答案不是 grepl()
,而是 stringr::str_detect()
,即
Vectorised over string
and pattern
. Equivalent to grepl(pattern, x)
.
备注
因为您只是想检测 df1$fullnames
中的 literal 字符串是否包含 literal 字符串51=],您不想意外地将 df2$lastnames
中的字符串视为 regular expression 模式 。现在,您的 df2$lastnames
列在统计上不太可能包含具有特殊正则表达式字符的名称; -
是唯一的例外,它在 []
之外按字面解释, 是 非常不可能在名称中找到。
如果您仍然担心意外的正则表达式,您可能需要考虑 alternative search methods with stringi::stri_detect_fixed()
or stringi::stri_detect_coll()
. These perform literal matching, respectively by either byte or "canonical equivalence";后者根据语言环境和特殊字符进行调整,以与自然语言处理保持一致。
鉴于您的两个数据框,这似乎可行:
已编辑 根据@Greg 的评论:
代码已适应发布的数据;如果在你的实际数据中,有更多的变体,特别是姓氏,例如不仅 III
还有 IV
,请随意相应地调整代码:
library(dplyr)
df1 %>%
mutate(
# create new column that gets rid of strings after last name:
lastnames = sub("\sI{1,3}$|,.+$", "", fullnames),
# grab last names:
lastnames = sub(".*?(\w+)$", "\1", lastnames)) %>%
# join the two dataframes:
left_join(., df2, by = c("lastnames", "ages"))
fullnames ages lastnames homestate
1 Jane Doe 30 Doe NJ
2 Mr. John Smith 51 Smith MA
3 Nate Cox, Esq. 45 Cox CT
4 Bill Lee III 38 Lee NY
5 Ms. Kate Smith 20 Smith MA
如果你想lastnames
删除,只需在%>%
:
之后附加这个
select(-lastnames)
编辑 #2:
如果您不相信上述解决方案,因为姓氏的实际记录方式存在巨大差异,那么当然 fuzzy_join
也是一种选择。 BUT,目前的fuzzy_join
方案还不够;它需要通过一次关键数据转换进行修改。这是因为 str_detect
检测一个字符串是否 包含 在另一个字符串中。也就是说,如果将 Smith
与 Smithsonian
或 Hammer-Smith
进行比较,它将 return TRUE - 每次字符串 Smith
确实包含在名字越长。如果,就像在大型数据集中的情况一样,Smith
和 Smithsonian
恰好具有相同的 ages
,则不匹配将是完美的:fuzzy_join
将错误地连接两个.同样的问题会出现,例如,Smith
和 Smith-Klein
年龄相同:fuzzy_join
也会加入他们。
第一组有问题的案例可以通过在 df2
中包含词边界锚 \b
来解决。这些断言,例如,Smith
必须由两边的单词边界限制,Smithsonian
不是这种情况,Smithsonian
的左侧确实有一个不可见的边界,但是右边的锚点在它的最后一个字母 n
之后。第二组有问题的情况可以通过在 \b
之后包含一个否定前瞻来解决,即 \b(?!-)
,它断言在单词边界之后不能有连字符。
使用 mutate
和 paste0
可以轻松实现解决方案,如下所示:
fuzzy_join(
df1, df2 %>%
mutate(lastnames = paste0("\b", lastnames, "\b(?!-)")),
by = c("ages", "fullnames" = "lastnames"),
match_fun = list(`==`, str_detect),
mode = "left"
) %>%
select(fullnames, ages = ages.x, homestate)
我希望根据条件连接两个数据帧,在本例中,一个字符串在另一个字符串中。假设我有两个数据框,
df1 <- data.frame(fullnames=c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"),
ages = c(30, 51, 45, 38, 20))
fullnames ages
1 Jane Doe 30
2 Mr. John Smith 51
3 Nate Cox, Esq. 45
4 Bill Lee III 38
5 Ms. Kate Smith 20
df2 <- data.frame(lastnames=c("Doe", "Cox", "Smith", "Jung", "Smith", "Lee"),
ages=c(30, 45, 20, 28, 51, 38),
homestate=c("NJ", "CT", "MA", "RI", "MA", "NY"))
lastnames ages homestate
1 Doe 30 NJ
2 Cox 45 CT
3 Smith 20 MA
4 Jung 28 RI
5 Smith 51 MA
6 Lee 38 NY
我想对这两个关于年龄的数据框和 df2$lastnames
包含在 df1$fullnames
中的行进行左连接。我认为 fuzzy_join
可以做到,但我认为它不喜欢我的 grepl
:
joined_dfs <- fuzzy_join(df1, df2, by = c("ages", "fullnames"="lastnames"),
+ match_fun = c("=", "grepl()"),
+ mode="left")
Error in which(m) : argument to 'which' is not logical
期望的结果:一个与第一个相同但附加了“homestate”列的数据框。有什么想法吗?
TLDR
你只需要修复match_fun
:
# ...
match_fun = list(`==`, stringr::str_detect),
# ...
背景
你的想法是对的,但是你对fuzzyjoin::fuzzy_join()
. Per the documentation中的match_fun
参数的解释错了,match_fun
应该是
Vectorized function given two columns, returning TRUE or FALSE as to whether they are a match. Can be a list of functions one for each pair of columns specified in
by
(if a named list, it uses the names in x). If only one function is given it is used on all column pairs.
解决方案
通过 dplyr
进一步格式化,一个简单的更正就可以解决问题。为了概念清晰,我在排版上将 by
列与用于匹配它们的 function
对齐:
library(dplyr)
# ...
# Existing code
# ...
joined_dfs <- fuzzy_join(
df1, df2,
by = c("ages", "fullnames" = "lastnames"),
# |----| |-----------------------|
match_fun = list(`==` , stringr::str_detect ),
# |--| |-----------------|
# Match by equality ^ ^ Match by detection of `lastnames` in `fullnames`
mode = "left"
) %>%
# Format resulting dataset as you requested.
select(fullnames, ages = ages.x, homestate)
结果
鉴于您在此处复制的样本数据
df1 <- data.frame(
fullnames = c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"),
ages = c(30, 51, 45, 38, 20)
)
df2 <- data.frame(
lastnames = c("Doe", "Cox", "Smith", "Jung", "Smith", "Lee"),
ages = c(30, 45, 20, 28, 51, 38),
homestate = c("NJ", "CT", "MA", "RI", "MA", "NY")
)
此解决方案应为 joined_dfs
生成以下 data.frame
,格式符合要求:
fullnames ages homestate
1 Jane Doe 30 NJ
2 Mr. John Smith 51 MA
3 Nate Cox, Esq. 45 CT
4 Bill Lee III 38 NY
5 Ms. Kate Smith 20 MA
备注
因为每个ages
恰好是一个唯一的key,下面的join on only only *names
fuzzy_join(
df1, df2,
by = c("fullnames" = "lastnames"),
match_fun = stringr::str_detect,
mode = "left"
)
将更好地说明匹配子字符串的行为:
fullnames ages.x lastnames ages.y homestate
1 Jane Doe 30 Doe 30 NJ
2 Mr. John Smith 51 Smith 20 MA
3 Mr. John Smith 51 Smith 51 MA
4 Nate Cox, Esq. 45 Cox 45 CT
5 Bill Lee III 38 Lee 38 NY
6 Ms. Kate Smith 20 Smith 20 MA
7 Ms. Kate Smith 20 Smith 51 MA
你错在哪里
类型错误
传递给 match_fun
的值应该是(symbol
for) a function
fuzzyjoin::fuzzy_join(
# ...
match_fun = grepl
# ...
)
或一个 list
这样的 (symbol
s for) function
s:
fuzzyjoin::fuzzy_join(
# ...
match_fun = list(`=`, grepl)
# ...
)
而不是提供 list
的 symbol
s
match_fun = list(=, grepl)
您错误地提供了 vector
of character
个字符串:
match_fun = c("=", "grepl()")
语法错误
用户应该name function
s
`=`
grepl
然而你错误地试图呼叫他们:
=
grepl()
命名它们会按预期将 function
的 本身 传递给 match_fun
,而调用它们会传递它们的 return 值*。在 R 中,像 =
这样的运算符使用反引号命名:`=`
.
* 假设调用没有因错误而失败。在这里,他们会失败。
不合适的函数
要比较两个值是否相等,这里是 character
向量 df1$fullnames
和 df2$lastnames
,您应该使用 关系运算符 ==
; yet you incorrectly supplied the assignment operator =
。
此外grepl()
is not vectorized in quite the way match_fun
desires. While its second argument(x
)确实是一个向量
a character vector where matches are sought, or an object which can be coerced by as.character to a character vector. Long vectors are supported.
它的第一个 argument (pattern
) 是(被视为)单个 character
字符串:
character string containing a regular expression (or character string for
fixed = TRUE
) to be matched in the given character vector. Coerced byas.character
to a character string if possible. If a character vector of length 2 or more is supplied, the first element is used with a warning. Missing values are allowed except forregexpr
,gregexpr
andregexec
.
因此,grepl()
不是
Vectorized function given two columns...
而是 function
给定一个字符串(标量)和一列(向量)字符串。
你祈祷的答案不是 grepl()
,而是 stringr::str_detect()
,即
Vectorised over
string
andpattern
. Equivalent togrepl(pattern, x)
.
备注
因为您只是想检测 df1$fullnames
中的 literal 字符串是否包含 literal 字符串51=],您不想意外地将 df2$lastnames
中的字符串视为 regular expression 模式 。现在,您的 df2$lastnames
列在统计上不太可能包含具有特殊正则表达式字符的名称; -
是唯一的例外,它在 []
之外按字面解释, 是 非常不可能在名称中找到。
如果您仍然担心意外的正则表达式,您可能需要考虑 alternative search methods with stringi::stri_detect_fixed()
or stringi::stri_detect_coll()
. These perform literal matching, respectively by either byte or "canonical equivalence";后者根据语言环境和特殊字符进行调整,以与自然语言处理保持一致。
鉴于您的两个数据框,这似乎可行:
已编辑 根据@Greg 的评论:
代码已适应发布的数据;如果在你的实际数据中,有更多的变体,特别是姓氏,例如不仅 III
还有 IV
,请随意相应地调整代码:
library(dplyr)
df1 %>%
mutate(
# create new column that gets rid of strings after last name:
lastnames = sub("\sI{1,3}$|,.+$", "", fullnames),
# grab last names:
lastnames = sub(".*?(\w+)$", "\1", lastnames)) %>%
# join the two dataframes:
left_join(., df2, by = c("lastnames", "ages"))
fullnames ages lastnames homestate
1 Jane Doe 30 Doe NJ
2 Mr. John Smith 51 Smith MA
3 Nate Cox, Esq. 45 Cox CT
4 Bill Lee III 38 Lee NY
5 Ms. Kate Smith 20 Smith MA
如果你想lastnames
删除,只需在%>%
:
select(-lastnames)
编辑 #2:
如果您不相信上述解决方案,因为姓氏的实际记录方式存在巨大差异,那么当然 fuzzy_join
也是一种选择。 BUT,目前的fuzzy_join
方案还不够;它需要通过一次关键数据转换进行修改。这是因为 str_detect
检测一个字符串是否 包含 在另一个字符串中。也就是说,如果将 Smith
与 Smithsonian
或 Hammer-Smith
进行比较,它将 return TRUE - 每次字符串 Smith
确实包含在名字越长。如果,就像在大型数据集中的情况一样,Smith
和 Smithsonian
恰好具有相同的 ages
,则不匹配将是完美的:fuzzy_join
将错误地连接两个.同样的问题会出现,例如,Smith
和 Smith-Klein
年龄相同:fuzzy_join
也会加入他们。
第一组有问题的案例可以通过在 df2
中包含词边界锚 \b
来解决。这些断言,例如,Smith
必须由两边的单词边界限制,Smithsonian
不是这种情况,Smithsonian
的左侧确实有一个不可见的边界,但是右边的锚点在它的最后一个字母 n
之后。第二组有问题的情况可以通过在 \b
之后包含一个否定前瞻来解决,即 \b(?!-)
,它断言在单词边界之后不能有连字符。
使用 mutate
和 paste0
可以轻松实现解决方案,如下所示:
fuzzy_join(
df1, df2 %>%
mutate(lastnames = paste0("\b", lastnames, "\b(?!-)")),
by = c("ages", "fullnames" = "lastnames"),
match_fun = list(`==`, str_detect),
mode = "left"
) %>%
select(fullnames, ages = ages.x, homestate)