根据commonin R中的最大单词数合并两个数据帧

Merging two data frame based on maximum numbers of words in commonin R

我有两个 data.frame 一个包含部分名称,另一个包含全名,如下所示

partial <- data.frame( "partial.name" = c("Apple", "Apple", "WWF",
"wizz air", "WeMove.eu", "ILU")
full <- data.frame("full.name" = c("Apple Inc", "wizzair", "We Move Europe",
"World Wide Fundation (WWF)", "(ILU)", "Ilusion")

在理想的世界里,我很想有一个像这样的table(我真正的部分df有12794行)

print(partial)
partial full
Apple   Apple Inc
Apple   Apple Inc
WWF World Wide Fundation (WWF)
wizz air wizzair
WeMove.eu We Move Europe
... 12 794 total rows

对于没有答案的每一行,我想成为 NA

我尝试了很多东西,fuzzyjoinregexregex_left_join 甚至包 sqldf。我有一些结果,但我知道如果 regex_left_join 了解我正在寻找我在 stringr 中知道的词,boundary( type = c("word")) 存在但我不知道如何实现它会更好。

目前,我只是准备了部分 df,以去除非字母数字信息并将其设为小写。

partial$regex <- str_squish((str_replace_all(partial$partial.name, regex("\W+"), " ")))
partial$regex <- tolower(partial$regex)

如何根据最大共同字数匹配partial$partial.name full$full.name

部分字符串匹配非常耗时。我相信 Jaro-Winkler 距离是一个很好的选择,但您需要花时间调整参数。这是一个让你继续前进的例子。

library(stringdist)

partial <- data.frame( "partial.name" = c("Apple", "Apple", "WWF", "wizz air", "WeMove.eu", "ILU", 'None'), stringsAsFactors = F)
full <- data.frame("full.name" = c("Apple Inc", "wizzair", "We Move Europe", "World Wide Foundation (WWF)", "(ILU)", "Ilusion"), stringsAsFactors = F)

mydist <- function(partial, list_of_fulls, method='jw', p = 0, threshold = 0.4) {
    find_dist <- function(first, second, method = method, p = p) {
        stringdist(a = first, b = second, method = method, p = p)
    }
    distances <- unlist(lapply(list_of_fulls, function(full) find_dist(first = full, second = partial, method = method, p = p)))
    # If the distance is too great assume NA 
    if (min(distances) > threshold) {
        NA
    } else {
        closest_index <- which.min(distances)
        list_of_fulls[closest_index]
    }
}

partial$match <- unlist(lapply(partial$partial.name, function(partial) mydist(partial = partial, list_of_fulls = full$full.name, method = 'jw')))

partial
#  partial.name                       match
#1        Apple                   Apple Inc
#2        Apple                   Apple Inc
#3          WWF World Wide Foundation (WWF)
#4     wizz air                     wizzair
#5    WeMove.eu              We Move Europe
#6          ILU                       (ILU)
#7         None                        <NA>