在条件下加入两个数据帧（grepl）

Question

我希望根据条件连接两个数据帧，在本例中，一个字符串在另一个字符串中。假设我有两个数据框，

df1 <- data.frame(fullnames=c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"), 
                  ages = c(30, 51, 45, 38, 20))

       fullnames ages
1       Jane Doe   30
2 Mr. John Smith   51
3 Nate Cox, Esq.   45
4   Bill Lee III   38
5 Ms. Kate Smith   20

df2 <- data.frame(lastnames=c("Doe", "Cox", "Smith", "Jung", "Smith", "Lee"), 
                  ages=c(30, 45, 20, 28, 51, 38), 
                  homestate=c("NJ", "CT", "MA", "RI", "MA", "NY"))
  lastnames ages homestate
1       Doe   30        NJ
2       Cox   45        CT
3     Smith   20        MA
4      Jung   28        RI
5     Smith   51        MA
6       Lee   38        NY

我想对这两个关于年龄的数据框和 df2$lastnames 包含在 df1$fullnames 中的行进行左连接。我认为 fuzzy_join 可以做到，但我认为它不喜欢我的 grepl:

joined_dfs <- fuzzy_join(df1, df2, by = c("ages", "fullnames"="lastnames"), 
+                          match_fun = c("=", "grepl()"),
+                          mode="left")
Error in which(m) : argument to 'which' is not logical

期望的结果：一个与第一个相同但附加了“homestate”列的数据框。有什么想法吗？

Answer 1

TLDR

你只需要修复match_fun:

# ...
match_fun = list(`==`, stringr::str_detect),
# ...

背景

你的想法是对的，但是你对fuzzyjoin::fuzzy_join(). Per the documentation中的match_fun参数的解释错了，match_fun应该是

Vectorized function given two columns, returning TRUE or FALSE as to whether they are a match. Can be a list of functions one for each pair of columns specified in by (if a named list, it uses the names in x). If only one function is given it is used on all column pairs.

解决方案

通过 dplyr 进一步格式化，一个简单的更正就可以解决问题。为了概念清晰，我在排版上将 by 列与用于匹配它们的 function 对齐：

library(dplyr)

# ...
# Existing code
# ...

joined_dfs <- fuzzy_join(
  df1, df2,

  by        =       c("ages", "fullnames" = "lastnames"),
  #                   |----|  |-----------------------|
  match_fun =    list(`==`  , stringr::str_detect      ),
  #                   |--|    |-----------------|
  #   Match by equality ^      ^ Match by detection of `lastnames` in `fullnames`    

  mode = "left"
) %>%
  # Format resulting dataset as you requested.
  select(fullnames, ages = ages.x, homestate)

结果

鉴于您在此处复制的样本数据

df1 <- data.frame(
  fullnames = c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"),
  ages = c(30, 51, 45, 38, 20)
)

df2 <- data.frame(
  lastnames = c("Doe", "Cox", "Smith", "Jung", "Smith", "Lee"),
  ages = c(30, 45, 20, 28, 51, 38),
  homestate = c("NJ", "CT", "MA", "RI", "MA", "NY")
)

此解决方案应为 joined_dfs 生成以下 data.frame，格式符合要求：

        fullnames ages homestate
1       Jane Doe   30        NJ
2 Mr. John Smith   51        MA
3 Nate Cox, Esq.   45        CT
4   Bill Lee III   38        NY
5 Ms. Kate Smith   20        MA

备注

因为每个ages恰好是一个唯一的key，下面的join on only only *names

fuzzy_join(
  df1, df2,
  by = c("fullnames" = "lastnames"),
  match_fun = stringr::str_detect,
  mode = "left"
)

将更好地说明匹配子字符串的行为：

       fullnames ages.x lastnames ages.y homestate
1       Jane Doe     30       Doe     30        NJ
2 Mr. John Smith     51     Smith     20        MA
3 Mr. John Smith     51     Smith     51        MA
4 Nate Cox, Esq.     45       Cox     45        CT
5   Bill Lee III     38       Lee     38        NY
6 Ms. Kate Smith     20     Smith     20        MA
7 Ms. Kate Smith     20     Smith     51        MA

你错在哪里

类型错误

传递给 match_fun 的值应该是（symbol for) a function

fuzzyjoin::fuzzy_join(
  # ...
  match_fun = grepl
  # ...
)

或一个 list 这样的 (symbols for) functions:

fuzzyjoin::fuzzy_join(
  # ...
  match_fun = list(`=`, grepl)
  # ...
)

而不是提供 list 的 symbols

match_fun = list(=, grepl)

您错误地提供了 vector of character 个字符串：

match_fun = c("=", "grepl()")

语法错误

用户应该name functions

`=`
grepl

然而你错误地试图呼叫他们：

=
grepl()

命名它们会按预期将 function 的本身传递给 match_fun，而调用它们会传递它们的 return 值*。在 R 中，像 = 这样的运算符使用反引号命名：`=`.

* 假设调用没有因错误而失败。在这里，他们会失败。

不合适的函数

要比较两个值是否相等，这里是 character 向量 df1$fullnames 和 df2$lastnames，您应该使用 关系运算符 ==; yet you incorrectly supplied the assignment operator =。

此外grepl() is not vectorized in quite the way match_fun desires. While its second argument(x)确实是一个向量

a character vector where matches are sought, or an object which can be coerced by as.character to a character vector. Long vectors are supported.

它的第一个 argument (pattern) 是（被视为）单个 character 字符串：

character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector. Coerced by as.character to a character string if possible. If a character vector of length 2 or more is supplied, the first element is used with a warning. Missing values are allowed except for regexpr, gregexpr and regexec.

因此，grepl() 不是

Vectorized function given two columns...

而是 function 给定一个字符串（标量）和一列（向量）字符串。

你祈祷的答案不是 grepl()，而是 stringr::str_detect()，即

Vectorised over string and pattern. Equivalent to grepl(pattern, x).

并包裹 stringi::stri_detect()。

备注

因为您只是想检测 df1$fullnames 中的 literal 字符串是否包含 literal 字符串51=]，您不想意外地将 df2$lastnames 中的字符串视为 regular expression 模式。现在，您的 df2$lastnames 列在统计上不太可能包含具有特殊正则表达式字符的名称； - 是唯一的例外，它在 [] 之外按字面解释，是非常不可能在名称中找到。

如果您仍然担心意外的正则表达式，您可能需要考虑 alternative search methods with stringi::stri_detect_fixed() or stringi::stri_detect_coll(). These perform literal matching, respectively by either byte or "canonical equivalence"；后者根据语言环境和特殊字符进行调整，以与自然语言处理保持一致。

Answer 2

鉴于您的两个数据框，这似乎可行：

已编辑 根据@Greg 的评论：

代码已适应发布的数据；如果在你的实际数据中，有更多的变体，特别是姓氏，例如不仅 III 还有 IV，请随意相应地调整代码：

library(dplyr)
df1 %>%
  mutate(
    # create new column that gets rid of strings after last name:
    lastnames = sub("\sI{1,3}$|,.+$", "", fullnames),
    # grab last names:
    lastnames = sub(".*?(\w+)$", "\1", lastnames)) %>%
  # join the two dataframes:
  left_join(., df2, by = c("lastnames", "ages"))
       fullnames ages lastnames homestate
1       Jane Doe   30       Doe        NJ
2 Mr. John Smith   51     Smith        MA
3 Nate Cox, Esq.   45       Cox        CT
4   Bill Lee III   38       Lee        NY
5 Ms. Kate Smith   20     Smith        MA

如果你想lastnames删除，只需在%>%:

之后附加这个

select(-lastnames)

编辑 #2:

如果您不相信上述解决方案，因为姓氏的实际记录方式存在巨大差异，那么当然 fuzzy_join 也是一种选择。 BUT，目前的fuzzy_join方案还不够；它需要通过一次关键数据转换进行修改。这是因为 str_detect 检测一个字符串是否包含在另一个字符串中。也就是说，如果将 Smith 与 Smithsonian 或 Hammer-Smith 进行比较，它将 return TRUE - 每次字符串 Smith 确实包含在名字越长。如果，就像在大型数据集中的情况一样，Smith 和 Smithsonian 恰好具有相同的 ages，则不匹配将是完美的：fuzzy_join 将错误地连接两个.同样的问题会出现，例如，Smith 和 Smith-Klein 年龄相同：fuzzy_join 也会加入他们。

第一组有问题的案例可以通过在 df2 中包含词边界锚 \b 来解决。这些断言，例如，Smith 必须由两边的单词边界限制，Smithsonian 不是这种情况，Smithsonian 的左侧确实有一个不可见的边界，但是右边的锚点在它的最后一个字母 n 之后。第二组有问题的情况可以通过在 \b 之后包含一个否定前瞻来解决，即 \b(?!-)，它断言在单词边界之后不能有连字符。

使用 mutate 和 paste0 可以轻松实现解决方案，如下所示：

fuzzy_join(
  df1, df2 %>%
    mutate(lastnames = paste0("\b", lastnames, "\b(?!-)")),
  by  = c("ages", "fullnames" = "lastnames"),
  match_fun = list(`==`, str_detect),
  mode = "left"
) %>%
  select(fullnames, ages = ages.x, homestate)

在条件下加入两个数据帧（grepl）

Joining two dataframes on a condition (grepl)

r

left-join

grepl

fuzzyjoin