在 R 中查找字符串的部分匹配

Finding partial matches on strings in R

我有一个非常大的数据库,名称如下:

names <- c("William Gates", "Bill Gates", "Gates, William H. III", 
    "Gates, William III", "William H Gates", "William H. Gates", 
    "Carlos Slim Helu & family", "Carlos Slim Helu", 
    "Carlos Slim & Family", "Carlos Slim")

我想 'clean up' 自动像这样:

new_names <- c("William Gates", "William Gates", "William Gates", 
    "William Gates", "William Gates", "William Gates", 
    "Carlos Slim Helu & family", "Carlos Slim Helu & family", 
    "Carlos Slim Helu & family", "Carlos Slim Helu & family")

我(任意)使用该名称的第一次出现来替换它的其他变体。

在此示例中,names 是长度为 10 的字符向量。我想创建一个 "partial match values" 的 10 X 10 矩阵。该矩阵将 "measures" 存储在部分匹配范围的 0 和 1 之间。例如,将 names[1]names[1] 进行比较会产生完美匹配,因此该值为 1;将 names[1]names[2] 进行比较会得出类似 5/12 = 0.41667 的结果,反映出 Gates 对两个字符串都是通用的,并且(忽略空字符串)names[1] 有 12 个字母;按照相同的逻辑,将 names[2]names[1] 进行比较会得出类似 5/9 = 0.55556 的结果。

我可能会忽略大小写(family 和 Family 是完美匹配)并且只关注匹配子字符串(但如果有人对如何匹配发表评论,比如 Slim 和 Silm,那也很好。

作为第二步,我将创建一个最大值的三角矩阵(在示例中,值 5/9 = 0.55556)。然后我会使用这个矩阵来观察情况,并 select 一个阈值,比如 0.95,高于该阈值的字符串将被替换,逐渐降低阈值,直到我对数据已清理感到满意为止。

我希望这种事情以前有人做过,并且有人能够帮助我开始。我已经阅读了 Paul Murrell 的 compare 包,并期望它会是一个很好用的工具,但我还没有看到太多可以很容易改编的例子,所以如果你知道一个教程或其他例子包裹小插图,请指点我。

我确实意识到一个好问题需要更多代码,对于无法提供太多内容,我深表歉意。虽然我相当熟悉 R,但我不熟悉字符串匹配。如果有人指出我从某个地方开始,我可以尝试用一些示例代码重新表述我的问题。

这是一个简单的尝试。仅使用内置函数而不创建任何矩阵,但它似乎适用于这个简单的示例。

names <- c("William Gates", "Bill Gates", "Gates, William H. III", 
           "Gates, William III", "William H Gates", "William H. Gates", 
           "Carlos Slim Helu & family", "Carlos Slim Helu", 
           "Carlos Slim & Family", "Carlos Slim")

new_names <- c("William Gates", "William Gates", "William Gates", 
               "William Gates", "William Gates", "William Gates", 
               "Carlos Slim Helu & family", "Carlos Slim Helu & family", 
               "Carlos Slim Helu & family", "Carlos Slim Helu & family")

nn <- c('Bill Gates','Carlos Slim')


cbind(names, sapply(nn, function(x) 
  ifelse(agrepl(x, names, max.distance = 5), x, NA)))

#      names                       Bill Gates   Carlos Slim  
# [1,] "William Gates"             "Bill Gates" NA           
# [2,] "Bill Gates"                "Bill Gates" NA           
# [3,] "Gates, William H. III"     "Bill Gates" NA           
# [4,] "Gates, William III"        "Bill Gates" NA           
# [5,] "William H Gates"           "Bill Gates" NA           
# [6,] "William H. Gates"          "Bill Gates" NA           
# [7,] "Carlos Slim Helu & family" NA           "Carlos Slim"
# [8,] "Carlos Slim Helu"          NA           "Carlos Slim"
# [9,] "Carlos Slim & Family"      NA           "Carlos Slim"
# [10,] "Carlos Slim"               NA           "Carlos Slim"

编辑

names <- c("William Gates", "Bill Gates", "Gates, William H. III", 
           "Gates, William III", "William H Gates", "William H. Gates", 
           "Carlos Slim Helu & family", "Carlos Slim Helu", 
           "Carlos Slim & Family", "Carlos Slim")

names <- gsub('[[:punct:]]', '', names)
nn <- sort(table(unlist(strsplit(names, ' '))))
nn <- names(nn[nn >= 4])

cbind(names, sapply(nn, function(x) 
  ifelse(agrepl(x, names, max.distance = 1), x, NA)))

#      names                      Carlos   Slim   William   Gates  
# [1,] "William Gates"            NA       NA     "William" "Gates"
# [2,] "Bill Gates"               NA       NA     NA        "Gates"
# [3,] "Gates William H III"      NA       NA     "William" "Gates"
# [4,] "Gates William III"        NA       NA     "William" "Gates"
# [5,] "William H Gates"          NA       NA     "William" "Gates"
# [6,] "William H Gates"          NA       NA     "William" "Gates"
# [7,] "Carlos Slim Helu  family" "Carlos" "Slim" NA        NA     
# [8,] "Carlos Slim Helu"         "Carlos" "Slim" NA        NA     
# [9,] "Carlos Slim  Family"      "Carlos" "Slim" NA        NA     
# [10,] "Carlos Slim"              "Carlos" "Slim" NA        NA   

stringdist 包可能有助于获取矩阵 - 2014 年 6 月 R journal 中也对其进行了描述。更新:其中一种 qgram 方法可能最适合姓氏、名字或名字、姓氏

library(stringdist)
stringdistmatrix(names, names, "jaccard")
        [,1]  [,2]  [,3]  [,4]   [,5]   [,6]  [,7]  [,8]  [,9] [,10]
 [1,] 0.0000 0.273 0.286 0.167 0.0909 0.1667 0.632 0.562 0.647 0.571
 [2,] 0.2727 0.000 0.467 0.385 0.3333 0.3846 0.684 0.625 0.706 0.643
 [3,] 0.2857 0.467 0.000 0.143 0.2143 0.1429 0.636 0.579 0.714 0.667
 [4,] 0.1667 0.385 0.143 0.000 0.2308 0.2857 0.667 0.611 0.684 0.625
 [5,] 0.0909 0.333 0.214 0.231 0.0000 0.0833 0.579 0.500 0.667 0.600
 ...

基于 adist 和聚类的完整答案。

使用参数 partial=TRUEignore.case=TRUE,函数 adist from base R 似乎可以解决这个问题。长久以来 haul,Chris S 指出的库 stringdist 似乎 很有前途,但也可以使用这种方法。

此解决方案通过 hclust 使用集群,采用 'single linkage' 采用 'friends of friends' 方法的方法适合 对于这个问题。

请注意,这需要根据簇高度选择阈值 (在这种情况下,累积广义 Levenshtein 距离 通过 single-link 标准查看的名称)。如果聚类不是太 对于您的问题而言,比可视化或检查的输出昂贵 hclust应该也不错。

 ## renamed to avoid overwriting names() function
  raw_names <- c("William Gates", "Bill Gates", "Gates, William H. III", 
      "Gates, William III", "William H Gates", "William H. Gates", 
      "Carlos Slim Helu & family", "Carlos Slim Helu", 
      "Carlos Slim & Family", "Carlos Slim")

 lev_dist <- adist(raw_names, raw_names, partial=TRUE, ignore.case=TRUE)

 #use single linkage method as it suits the problem
 hc <- hclust(as.dist(lev_dist), method='single')

 ## cluster vis for picking threshold
 plot(hc, labels=raw_names)
 threshold <- 6 ## in terms of cluster height --

 ## based on threshold, get clusters and make labels
 cluster <- cutree(hc, h=threshold)
 cluster_labels <- sapply(unique(cluster), function(i) raw_names[min(which(cluster == i))])
 (new_names <- cluster_labels[cluster])

 ##  [1] "William Gates" "William Gates" "William Gates"
 ## "Carlos Slim Helu & family" "Carlos Slim Helu & family" [6]
 ## "William Gates" "William Gates" "William Gates"
 ## "Carlos Slim Helu & family" "Carlos Slim Helu & family"