进行 fuzzyjoin 并在存在时仅保留完全匹配,否则保留所有选项
make a fuzzyjoin and keep only exact match when there is one, keep all options otherwise
我有两个数据框,我试图根据国家名称字段加入它们,我想要实现的是:当找到完美匹配时,我只想保留那一行,否则我会喜欢显示所有 rows/options.
library(fuzzyjoin)
df1 <- data.frame(
country = c('Germany','Germany and Spain','Italy','Norway and Sweden','Austria','Spain'),
score = c(7,8,9,10,11,12)
)
df2 <- data.frame(
country_name = c('Germany and Spain','Germany','Germany.','Germania','Deutschland','Germany - ','Spun','Spain and Portugal','Italy','Italia','Greece and Italy',
'Australia','Austria...','Norway (Scandinavia)','Norway','Sweden'),
comments = c('xxx','rrr','ttt','hhhh','gggg','jjjj','uuuuu','ooooo','yyyyyyyyyy','bbbbb','llllll','wwwwwww','nnnnnnn','cc','mmmm','lllll')
)
j <- regex_left_join(df1,df2, by = c('country' = 'country_name'), ignore_case = T)
结果(j)显示'Germany and Spain'出现了3次,第1次出现的是完美匹配,我想只保留这个,去掉另外两个。 'Norway and Sweden' 没有完美匹配,所以我想保留两个可能的 options/rows(原样)。
我该怎么做?
您可以使用 stringdist::stringdist
来计算匹配之间的距离,对于存在完全匹配的条目,仅保留:
library(dplyr)
j %>%
mutate(dist = stringdist::stringdist(country, country_name)) %>% # add distance
group_by(country) %>% # group entries
mutate(exact = any(dist == 0)) %>% # check if exact match exists in group
filter(!exact | dist == 0) %>% # keep only entries where no exact match exists in the group OR where the entry is the exact match
ungroup()
#> # A tibble: 5 x 6
#> country score country_name comments dist exact
#> <chr> <dbl> <chr> <chr> <dbl> <lgl>
#> 1 Germany 7 Germany rrr 0 TRUE
#> 2 Germany and Spain 8 Germany and Spain xxx 0 TRUE
#> 3 Italy 9 Italy yyyyyyyyyy 0 TRUE
#> 4 Norway and Sweden 10 Norway mmmm 11 FALSE
#> 5 Norway and Sweden 10 Sweden lllll 11 FALSE
我有两个数据框,我试图根据国家名称字段加入它们,我想要实现的是:当找到完美匹配时,我只想保留那一行,否则我会喜欢显示所有 rows/options.
library(fuzzyjoin)
df1 <- data.frame(
country = c('Germany','Germany and Spain','Italy','Norway and Sweden','Austria','Spain'),
score = c(7,8,9,10,11,12)
)
df2 <- data.frame(
country_name = c('Germany and Spain','Germany','Germany.','Germania','Deutschland','Germany - ','Spun','Spain and Portugal','Italy','Italia','Greece and Italy',
'Australia','Austria...','Norway (Scandinavia)','Norway','Sweden'),
comments = c('xxx','rrr','ttt','hhhh','gggg','jjjj','uuuuu','ooooo','yyyyyyyyyy','bbbbb','llllll','wwwwwww','nnnnnnn','cc','mmmm','lllll')
)
j <- regex_left_join(df1,df2, by = c('country' = 'country_name'), ignore_case = T)
结果(j)显示'Germany and Spain'出现了3次,第1次出现的是完美匹配,我想只保留这个,去掉另外两个。 'Norway and Sweden' 没有完美匹配,所以我想保留两个可能的 options/rows(原样)。
我该怎么做?
您可以使用 stringdist::stringdist
来计算匹配之间的距离,对于存在完全匹配的条目,仅保留:
library(dplyr)
j %>%
mutate(dist = stringdist::stringdist(country, country_name)) %>% # add distance
group_by(country) %>% # group entries
mutate(exact = any(dist == 0)) %>% # check if exact match exists in group
filter(!exact | dist == 0) %>% # keep only entries where no exact match exists in the group OR where the entry is the exact match
ungroup()
#> # A tibble: 5 x 6
#> country score country_name comments dist exact
#> <chr> <dbl> <chr> <chr> <dbl> <lgl>
#> 1 Germany 7 Germany rrr 0 TRUE
#> 2 Germany and Spain 8 Germany and Spain xxx 0 TRUE
#> 3 Italy 9 Italy yyyyyyyyyy 0 TRUE
#> 4 Norway and Sweden 10 Norway mmmm 11 FALSE
#> 5 Norway and Sweden 10 Sweden lllll 11 FALSE