fuzzy_left_join 与 match_fun %in%
fuzzy_left_join with match_fun %in%
一些数据
example_df <- data.frame(
url = c('blog/blah', 'blog/?utm_medium=foo', 'blah', 'subscription/apples', 'UK/something'),
numbs = 1:5
)
lookup_df <- data.frame(
string = c('blog', 'subscription', 'UK'),
group = c('blog', 'subs', 'UK')
)
library(fuzzyjoin)
data_combined <- example_df %>%
fuzzy_left_join(lookup_df, by = c("url" = "string"),
match_fun = `%in%`)
data_combined
url numbs string group
1 blog/blah 1 <NA> <NA>
2 blog/?utm_medium=foo 2 <NA> <NA>
3 blah 3 <NA> <NA>
4 subscription/apples 4 <NA> <NA>
5 UK/something 5 <NA> <NA>
我希望 data_combined 具有基于 match_fun 的匹配项的字符串和组的值。取而代之的是所有 NA。
例如,lookup_df中字符串的第一个值是'blog'。由于这是 %in%
example_df 字符串的第一个值,因此应在字符串和组字段中与值 'blog' 和 'blog' 匹配。
如果我们想对 'url' 中 /
之前的单词与 'lookup_df' 中的 'string' 列进行部分匹配,我们可以提取该子字符串作为新列,然后执行 regex_left_join
library(dplyr)
library(fuzzyjoin)
library(stringr)
example_df %>%
mutate(string = str_remove(url, "\/.*")) %>%
regex_left_join(lookup_df, by = 'string') %>%
select(url, numbs, group)
-输出
# url numbs group
#1 blog/blah 1 blog
#2 blog/?utm_medium=foo 2 blog
#3 blah 3 <NA>
#4 subscription/apples 4 subs
#5 UK/something 5 UK
一些数据
example_df <- data.frame(
url = c('blog/blah', 'blog/?utm_medium=foo', 'blah', 'subscription/apples', 'UK/something'),
numbs = 1:5
)
lookup_df <- data.frame(
string = c('blog', 'subscription', 'UK'),
group = c('blog', 'subs', 'UK')
)
library(fuzzyjoin)
data_combined <- example_df %>%
fuzzy_left_join(lookup_df, by = c("url" = "string"),
match_fun = `%in%`)
data_combined
url numbs string group
1 blog/blah 1 <NA> <NA>
2 blog/?utm_medium=foo 2 <NA> <NA>
3 blah 3 <NA> <NA>
4 subscription/apples 4 <NA> <NA>
5 UK/something 5 <NA> <NA>
我希望 data_combined 具有基于 match_fun 的匹配项的字符串和组的值。取而代之的是所有 NA。
例如,lookup_df中字符串的第一个值是'blog'。由于这是 %in%
example_df 字符串的第一个值,因此应在字符串和组字段中与值 'blog' 和 'blog' 匹配。
如果我们想对 'url' 中 /
之前的单词与 'lookup_df' 中的 'string' 列进行部分匹配,我们可以提取该子字符串作为新列,然后执行 regex_left_join
library(dplyr)
library(fuzzyjoin)
library(stringr)
example_df %>%
mutate(string = str_remove(url, "\/.*")) %>%
regex_left_join(lookup_df, by = 'string') %>%
select(url, numbs, group)
-输出
# url numbs group
#1 blog/blah 1 blog
#2 blog/?utm_medium=foo 2 blog
#3 blah 3 <NA>
#4 subscription/apples 4 subs
#5 UK/something 5 UK