fuzzy_left_join 与 match_fun %in%

fuzzy_left_join with match_fun %in%

一些数据

example_df <- data.frame(
  url = c('blog/blah', 'blog/?utm_medium=foo', 'blah', 'subscription/apples', 'UK/something'),
  numbs = 1:5
)

lookup_df <- data.frame(
  string = c('blog', 'subscription', 'UK'),
  group = c('blog', 'subs', 'UK')
)


library(fuzzyjoin)
data_combined <- example_df %>% 
  fuzzy_left_join(lookup_df, by = c("url" = "string"), 
                  match_fun = `%in%`)

data_combined
                   url numbs string group
1            blog/blah     1   <NA>  <NA>
2 blog/?utm_medium=foo     2   <NA>  <NA>
3                 blah     3   <NA>  <NA>
4  subscription/apples     4   <NA>  <NA>
5         UK/something     5   <NA>  <NA>

我希望 data_combined 具有基于 match_fun 的匹配项的字符串和组的值。取而代之的是所有 NA。

例如,lookup_df中字符串的第一个值是'blog'。由于这是 %in% example_df 字符串的第一个值,因此应在字符串和组字段中与值 'blog' 和 'blog' 匹配。

如果我们想对 'url' 中 / 之前的单词与 'lookup_df' 中的 'string' 列进行部分匹配,我们可以提取该子字符串作为新列,然后执行 regex_left_join

library(dplyr)
library(fuzzyjoin)
library(stringr)
example_df %>%
    mutate(string = str_remove(url, "\/.*")) %>% 
    regex_left_join(lookup_df, by = 'string') %>%
    select(url, numbs, group)

-输出

#                   url numbs group
#1            blog/blah     1  blog
#2 blog/?utm_medium=foo     2  blog
#3                 blah     3  <NA>
#4  subscription/apples     4  subs
#5         UK/something     5    UK