在条件下加入两个数据帧(grepl)

Joining two dataframes on a condition (grepl)

我希望根据条件连接两个数据帧,在本例中,一个字符串在另一个字符串中。假设我有两个数据框,

df1 <- data.frame(fullnames=c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"), 
                  ages = c(30, 51, 45, 38, 20))

       fullnames ages
1       Jane Doe   30
2 Mr. John Smith   51
3 Nate Cox, Esq.   45
4   Bill Lee III   38
5 Ms. Kate Smith   20

df2 <- data.frame(lastnames=c("Doe", "Cox", "Smith", "Jung", "Smith", "Lee"), 
                  ages=c(30, 45, 20, 28, 51, 38), 
                  homestate=c("NJ", "CT", "MA", "RI", "MA", "NY"))
  lastnames ages homestate
1       Doe   30        NJ
2       Cox   45        CT
3     Smith   20        MA
4      Jung   28        RI
5     Smith   51        MA
6       Lee   38        NY

我想对这两个关于年龄的数据框和 df2$lastnames 包含在 df1$fullnames 中的行进行左连接。我认为 fuzzy_join 可以做到,但我认为它不喜欢我的 grepl:

joined_dfs <- fuzzy_join(df1, df2, by = c("ages", "fullnames"="lastnames"), 
+                          match_fun = c("=", "grepl()"),
+                          mode="left")
Error in which(m) : argument to 'which' is not logical

期望的结果:一个与第一个相同但附加了“homestate”列的数据框。有什么想法吗?

TLDR

你只需要修复match_fun:

# ...
match_fun = list(`==`, stringr::str_detect),
# ...

背景

你的想法是对的,但是你对fuzzyjoin::fuzzy_join(). Per the documentation中的match_fun参数的解释错了,match_fun应该是

Vectorized function given two columns, returning TRUE or FALSE as to whether they are a match. Can be a list of functions one for each pair of columns specified in by (if a named list, it uses the names in x). If only one function is given it is used on all column pairs.

解决方案

通过 dplyr 进一步格式化,一个简单的更正就可以解决问题。为了概念清晰,我在排版上将 by 列与用于匹配它们的 function 对齐:

library(dplyr)

# ...
# Existing code
# ...

joined_dfs <- fuzzy_join(
  df1, df2,

  by        =       c("ages", "fullnames" = "lastnames"),
  #                   |----|  |-----------------------|
  match_fun =    list(`==`  , stringr::str_detect      ),
  #                   |--|    |-----------------|
  #   Match by equality ^      ^ Match by detection of `lastnames` in `fullnames`    

  mode = "left"
) %>%
  # Format resulting dataset as you requested.
  select(fullnames, ages = ages.x, homestate)

结果

鉴于您在此处复制的样本数据

df1 <- data.frame(
  fullnames = c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"),
  ages = c(30, 51, 45, 38, 20)
)

df2 <- data.frame(
  lastnames = c("Doe", "Cox", "Smith", "Jung", "Smith", "Lee"),
  ages = c(30, 45, 20, 28, 51, 38),
  homestate = c("NJ", "CT", "MA", "RI", "MA", "NY")
)

此解决方案应为 joined_dfs 生成以下 data.frame,格式符合要求:

        fullnames ages homestate
1       Jane Doe   30        NJ
2 Mr. John Smith   51        MA
3 Nate Cox, Esq.   45        CT
4   Bill Lee III   38        NY
5 Ms. Kate Smith   20        MA

备注

因为每个ages恰好是一个唯一的key,下面的join on only only *names

fuzzy_join(
  df1, df2,
  by = c("fullnames" = "lastnames"),
  match_fun = stringr::str_detect,
  mode = "left"
)

将更好地说明匹配子字符串的行为:

       fullnames ages.x lastnames ages.y homestate
1       Jane Doe     30       Doe     30        NJ
2 Mr. John Smith     51     Smith     20        MA
3 Mr. John Smith     51     Smith     51        MA
4 Nate Cox, Esq.     45       Cox     45        CT
5   Bill Lee III     38       Lee     38        NY
6 Ms. Kate Smith     20     Smith     20        MA
7 Ms. Kate Smith     20     Smith     51        MA

你错在哪里

类型错误

传递给 match_fun 的值应该是(symbol for) a function

fuzzyjoin::fuzzy_join(
  # ...
  match_fun = grepl
  # ...
)

或一个 list 这样的 (symbols for) functions:

fuzzyjoin::fuzzy_join(
  # ...
  match_fun = list(`=`, grepl)
  # ...
)

而不是提供 listsymbols

match_fun = list(=, grepl)

您错误地提供了 vector of character 个字符串:

match_fun = c("=", "grepl()")

语法错误

用户应该name functions

`=`
grepl

然而你错误地试图呼叫他们:

=
grepl()

命名它们会按预期将 function 本身 传递给 match_fun,而调用它们会传递它们的 return 值*。在 R 中,像 = 这样的运算符使用反引号命名:`=`.

* 假设调用没有因错误而失败。在这里,他们失败。

不合适的函数

要比较两个值是否相等,这里是 character 向量 df1$fullnamesdf2$lastnames,您应该使用 关系运算符 ==; yet you incorrectly supplied the assignment operator =

此外grepl() is not vectorized in quite the way match_fun desires. While its second argument(x)确实是一个向量

a character vector where matches are sought, or an object which can be coerced by as.character to a character vector. Long vectors are supported.

它的第一个 argument (pattern) 是(被视为)单个 character 字符串:

character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector. Coerced by as.character to a character string if possible. If a character vector of length 2 or more is supplied, the first element is used with a warning. Missing values are allowed except for regexpr, gregexpr and regexec.

因此,grepl() 不是

Vectorized function given two columns...

而是 function 给定一个字符串(标量)和一列(向量)字符串。

你祈祷的答案不是 grepl(),而是 stringr::str_detect(),即

Vectorised over string and pattern. Equivalent to grepl(pattern, x).

并包裹 stringi::stri_detect()

备注

因为您只是想检测 df1$fullnames 中的 literal 字符串是否包含 literal 字符串51=],您不想意外地将 df2$lastnames 中的字符串视为 regular expression 模式 。现在,您的 df2$lastnames 列在统计上不太可能包含具有特殊正则表达式字符的名称; - 是唯一的例外,它在 [] 之外按字面解释, 非常不可能在名称中找到。

如果您仍然担心意外的正则表达式,您可能需要考虑 alternative search methods with stringi::stri_detect_fixed() or stringi::stri_detect_coll(). These perform literal matching, respectively by either byte or "canonical equivalence";后者根据语言环境和特殊字符进行调整,以与自然语言处理保持一致。

鉴于您的两个数据框,这似乎可行:

已编辑 根据@Greg 的评论:

代码已适应发布的数据;如果在你的实际数据中,有更多的变体,特别是姓氏,例如不仅 III 还有 IV,请随意相应地调整代码:

library(dplyr)
df1 %>%
  mutate(
    # create new column that gets rid of strings after last name:
    lastnames = sub("\sI{1,3}$|,.+$", "", fullnames),
    # grab last names:
    lastnames = sub(".*?(\w+)$", "\1", lastnames)) %>%
  # join the two dataframes:
  left_join(., df2, by = c("lastnames", "ages"))
       fullnames ages lastnames homestate
1       Jane Doe   30       Doe        NJ
2 Mr. John Smith   51     Smith        MA
3 Nate Cox, Esq.   45       Cox        CT
4   Bill Lee III   38       Lee        NY
5 Ms. Kate Smith   20     Smith        MA

如果你想lastnames删除,只需在%>%:

之后附加这个
select(-lastnames) 

编辑 #2:

如果您不相信上述解决方案,因为姓氏的实际记录方式存在巨大差异,那么当然 fuzzy_join 也是一种选择。 BUT,目前的fuzzy_join方案还不够;它需要通过一次关键数据转换进行修改。这是因为 str_detect 检测一个字符串是否 包含 在另一个字符串中。也就是说,如果将 SmithSmithsonianHammer-Smith 进行比较,它将 return TRUE - 每次字符串 Smith 确实包含在名字越长。如果,就像在大型数据集中的情况一样,SmithSmithsonian 恰好具有相同的 ages,则不匹配将是完美的:fuzzy_join 将错误地连接两个.同样的问题会出现,例如,SmithSmith-Klein 年龄相同:fuzzy_join 也会加入他们。

第一组有问题的案例可以通过在 df2 中包含词边界锚 \b 来解决。这些断言,例如,Smith 必须由两边的单词边界限制,Smithsonian 不是这种情况,Smithsonian 的左侧确实有一个不可见的边界,但是右边的锚点在它的最后一个字母 n 之后。第二组有问题的情况可以通过在 \b 之后包含一个否定前瞻来解决,即 \b(?!-),它断言在单词边界之后不能有连字符。

使用 mutatepaste0 可以轻松实现解决方案,如下所示:

fuzzy_join(
  df1, df2 %>%
    mutate(lastnames = paste0("\b", lastnames, "\b(?!-)")),
  by  = c("ages", "fullnames" = "lastnames"),
  match_fun = list(`==`, str_detect),
  mode = "left"
) %>%
  select(fullnames, ages = ages.x, homestate)