在 R 中查找字符串的部分匹配
Finding partial matches on strings in R
我有一个非常大的数据库,名称如下:
names <- c("William Gates", "Bill Gates", "Gates, William H. III",
"Gates, William III", "William H Gates", "William H. Gates",
"Carlos Slim Helu & family", "Carlos Slim Helu",
"Carlos Slim & Family", "Carlos Slim")
我想 'clean up' 自动像这样:
new_names <- c("William Gates", "William Gates", "William Gates",
"William Gates", "William Gates", "William Gates",
"Carlos Slim Helu & family", "Carlos Slim Helu & family",
"Carlos Slim Helu & family", "Carlos Slim Helu & family")
我(任意)使用该名称的第一次出现来替换它的其他变体。
在此示例中,names
是长度为 10 的字符向量。我想创建一个 "partial match values" 的 10 X 10
矩阵。该矩阵将 "measures" 存储在部分匹配范围的 0 和 1 之间。例如,将 names[1]
与 names[1]
进行比较会产生完美匹配,因此该值为 1;将 names[1]
与 names[2]
进行比较会得出类似 5/12 = 0.41667 的结果,反映出 Gates 对两个字符串都是通用的,并且(忽略空字符串)names[1]
有 12 个字母;按照相同的逻辑,将 names[2]
与 names[1]
进行比较会得出类似 5/9 = 0.55556 的结果。
我可能会忽略大小写(family 和 Family 是完美匹配)并且只关注匹配子字符串(但如果有人对如何匹配发表评论,比如 Slim 和 Silm,那也很好。
作为第二步,我将创建一个最大值的三角矩阵(在示例中,值 5/9 = 0.55556)。然后我会使用这个矩阵来观察情况,并 select 一个阈值,比如 0.95,高于该阈值的字符串将被替换,逐渐降低阈值,直到我对数据已清理感到满意为止。
我希望这种事情以前有人做过,并且有人能够帮助我开始。我已经阅读了 Paul Murrell 的 compare
包,并期望它会是一个很好用的工具,但我还没有看到太多可以很容易改编的例子,所以如果你知道一个教程或其他例子包裹小插图,请指点我。
我确实意识到一个好问题需要更多代码,对于无法提供太多内容,我深表歉意。虽然我相当熟悉 R,但我不熟悉字符串匹配。如果有人指出我从某个地方开始,我可以尝试用一些示例代码重新表述我的问题。
这是一个简单的尝试。仅使用内置函数而不创建任何矩阵,但它似乎适用于这个简单的示例。
names <- c("William Gates", "Bill Gates", "Gates, William H. III",
"Gates, William III", "William H Gates", "William H. Gates",
"Carlos Slim Helu & family", "Carlos Slim Helu",
"Carlos Slim & Family", "Carlos Slim")
new_names <- c("William Gates", "William Gates", "William Gates",
"William Gates", "William Gates", "William Gates",
"Carlos Slim Helu & family", "Carlos Slim Helu & family",
"Carlos Slim Helu & family", "Carlos Slim Helu & family")
nn <- c('Bill Gates','Carlos Slim')
cbind(names, sapply(nn, function(x)
ifelse(agrepl(x, names, max.distance = 5), x, NA)))
# names Bill Gates Carlos Slim
# [1,] "William Gates" "Bill Gates" NA
# [2,] "Bill Gates" "Bill Gates" NA
# [3,] "Gates, William H. III" "Bill Gates" NA
# [4,] "Gates, William III" "Bill Gates" NA
# [5,] "William H Gates" "Bill Gates" NA
# [6,] "William H. Gates" "Bill Gates" NA
# [7,] "Carlos Slim Helu & family" NA "Carlos Slim"
# [8,] "Carlos Slim Helu" NA "Carlos Slim"
# [9,] "Carlos Slim & Family" NA "Carlos Slim"
# [10,] "Carlos Slim" NA "Carlos Slim"
编辑
names <- c("William Gates", "Bill Gates", "Gates, William H. III",
"Gates, William III", "William H Gates", "William H. Gates",
"Carlos Slim Helu & family", "Carlos Slim Helu",
"Carlos Slim & Family", "Carlos Slim")
names <- gsub('[[:punct:]]', '', names)
nn <- sort(table(unlist(strsplit(names, ' '))))
nn <- names(nn[nn >= 4])
cbind(names, sapply(nn, function(x)
ifelse(agrepl(x, names, max.distance = 1), x, NA)))
# names Carlos Slim William Gates
# [1,] "William Gates" NA NA "William" "Gates"
# [2,] "Bill Gates" NA NA NA "Gates"
# [3,] "Gates William H III" NA NA "William" "Gates"
# [4,] "Gates William III" NA NA "William" "Gates"
# [5,] "William H Gates" NA NA "William" "Gates"
# [6,] "William H Gates" NA NA "William" "Gates"
# [7,] "Carlos Slim Helu family" "Carlos" "Slim" NA NA
# [8,] "Carlos Slim Helu" "Carlos" "Slim" NA NA
# [9,] "Carlos Slim Family" "Carlos" "Slim" NA NA
# [10,] "Carlos Slim" "Carlos" "Slim" NA NA
stringdist 包可能有助于获取矩阵 - 2014 年 6 月 R journal 中也对其进行了描述。更新:其中一种 qgram 方法可能最适合姓氏、名字或名字、姓氏
library(stringdist)
stringdistmatrix(names, names, "jaccard")
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0.0000 0.273 0.286 0.167 0.0909 0.1667 0.632 0.562 0.647 0.571
[2,] 0.2727 0.000 0.467 0.385 0.3333 0.3846 0.684 0.625 0.706 0.643
[3,] 0.2857 0.467 0.000 0.143 0.2143 0.1429 0.636 0.579 0.714 0.667
[4,] 0.1667 0.385 0.143 0.000 0.2308 0.2857 0.667 0.611 0.684 0.625
[5,] 0.0909 0.333 0.214 0.231 0.0000 0.0833 0.579 0.500 0.667 0.600
...
基于 adist
和聚类的完整答案。
使用参数 partial=TRUE
和 ignore.case=TRUE
,函数
adist
from base R
似乎可以解决这个问题。长久以来
haul,Chris S 指出的库 stringdist
似乎
很有前途,但也可以使用这种方法。
此解决方案通过 hclust
使用集群,采用 'single linkage'
采用 'friends of friends' 方法的方法适合
对于这个问题。
请注意,这需要根据簇高度选择阈值
(在这种情况下,累积广义 Levenshtein 距离
通过 single-link 标准查看的名称)。如果聚类不是太
对于您的问题而言,比可视化或检查的输出昂贵
hclust
应该也不错。
## renamed to avoid overwriting names() function
raw_names <- c("William Gates", "Bill Gates", "Gates, William H. III",
"Gates, William III", "William H Gates", "William H. Gates",
"Carlos Slim Helu & family", "Carlos Slim Helu",
"Carlos Slim & Family", "Carlos Slim")
lev_dist <- adist(raw_names, raw_names, partial=TRUE, ignore.case=TRUE)
#use single linkage method as it suits the problem
hc <- hclust(as.dist(lev_dist), method='single')
## cluster vis for picking threshold
plot(hc, labels=raw_names)
threshold <- 6 ## in terms of cluster height --
## based on threshold, get clusters and make labels
cluster <- cutree(hc, h=threshold)
cluster_labels <- sapply(unique(cluster), function(i) raw_names[min(which(cluster == i))])
(new_names <- cluster_labels[cluster])
## [1] "William Gates" "William Gates" "William Gates"
## "Carlos Slim Helu & family" "Carlos Slim Helu & family" [6]
## "William Gates" "William Gates" "William Gates"
## "Carlos Slim Helu & family" "Carlos Slim Helu & family"
我有一个非常大的数据库,名称如下:
names <- c("William Gates", "Bill Gates", "Gates, William H. III",
"Gates, William III", "William H Gates", "William H. Gates",
"Carlos Slim Helu & family", "Carlos Slim Helu",
"Carlos Slim & Family", "Carlos Slim")
我想 'clean up' 自动像这样:
new_names <- c("William Gates", "William Gates", "William Gates",
"William Gates", "William Gates", "William Gates",
"Carlos Slim Helu & family", "Carlos Slim Helu & family",
"Carlos Slim Helu & family", "Carlos Slim Helu & family")
我(任意)使用该名称的第一次出现来替换它的其他变体。
在此示例中,names
是长度为 10 的字符向量。我想创建一个 "partial match values" 的 10 X 10
矩阵。该矩阵将 "measures" 存储在部分匹配范围的 0 和 1 之间。例如,将 names[1]
与 names[1]
进行比较会产生完美匹配,因此该值为 1;将 names[1]
与 names[2]
进行比较会得出类似 5/12 = 0.41667 的结果,反映出 Gates 对两个字符串都是通用的,并且(忽略空字符串)names[1]
有 12 个字母;按照相同的逻辑,将 names[2]
与 names[1]
进行比较会得出类似 5/9 = 0.55556 的结果。
我可能会忽略大小写(family 和 Family 是完美匹配)并且只关注匹配子字符串(但如果有人对如何匹配发表评论,比如 Slim 和 Silm,那也很好。
作为第二步,我将创建一个最大值的三角矩阵(在示例中,值 5/9 = 0.55556)。然后我会使用这个矩阵来观察情况,并 select 一个阈值,比如 0.95,高于该阈值的字符串将被替换,逐渐降低阈值,直到我对数据已清理感到满意为止。
我希望这种事情以前有人做过,并且有人能够帮助我开始。我已经阅读了 Paul Murrell 的 compare
包,并期望它会是一个很好用的工具,但我还没有看到太多可以很容易改编的例子,所以如果你知道一个教程或其他例子包裹小插图,请指点我。
我确实意识到一个好问题需要更多代码,对于无法提供太多内容,我深表歉意。虽然我相当熟悉 R,但我不熟悉字符串匹配。如果有人指出我从某个地方开始,我可以尝试用一些示例代码重新表述我的问题。
这是一个简单的尝试。仅使用内置函数而不创建任何矩阵,但它似乎适用于这个简单的示例。
names <- c("William Gates", "Bill Gates", "Gates, William H. III",
"Gates, William III", "William H Gates", "William H. Gates",
"Carlos Slim Helu & family", "Carlos Slim Helu",
"Carlos Slim & Family", "Carlos Slim")
new_names <- c("William Gates", "William Gates", "William Gates",
"William Gates", "William Gates", "William Gates",
"Carlos Slim Helu & family", "Carlos Slim Helu & family",
"Carlos Slim Helu & family", "Carlos Slim Helu & family")
nn <- c('Bill Gates','Carlos Slim')
cbind(names, sapply(nn, function(x)
ifelse(agrepl(x, names, max.distance = 5), x, NA)))
# names Bill Gates Carlos Slim
# [1,] "William Gates" "Bill Gates" NA
# [2,] "Bill Gates" "Bill Gates" NA
# [3,] "Gates, William H. III" "Bill Gates" NA
# [4,] "Gates, William III" "Bill Gates" NA
# [5,] "William H Gates" "Bill Gates" NA
# [6,] "William H. Gates" "Bill Gates" NA
# [7,] "Carlos Slim Helu & family" NA "Carlos Slim"
# [8,] "Carlos Slim Helu" NA "Carlos Slim"
# [9,] "Carlos Slim & Family" NA "Carlos Slim"
# [10,] "Carlos Slim" NA "Carlos Slim"
编辑
names <- c("William Gates", "Bill Gates", "Gates, William H. III",
"Gates, William III", "William H Gates", "William H. Gates",
"Carlos Slim Helu & family", "Carlos Slim Helu",
"Carlos Slim & Family", "Carlos Slim")
names <- gsub('[[:punct:]]', '', names)
nn <- sort(table(unlist(strsplit(names, ' '))))
nn <- names(nn[nn >= 4])
cbind(names, sapply(nn, function(x)
ifelse(agrepl(x, names, max.distance = 1), x, NA)))
# names Carlos Slim William Gates
# [1,] "William Gates" NA NA "William" "Gates"
# [2,] "Bill Gates" NA NA NA "Gates"
# [3,] "Gates William H III" NA NA "William" "Gates"
# [4,] "Gates William III" NA NA "William" "Gates"
# [5,] "William H Gates" NA NA "William" "Gates"
# [6,] "William H Gates" NA NA "William" "Gates"
# [7,] "Carlos Slim Helu family" "Carlos" "Slim" NA NA
# [8,] "Carlos Slim Helu" "Carlos" "Slim" NA NA
# [9,] "Carlos Slim Family" "Carlos" "Slim" NA NA
# [10,] "Carlos Slim" "Carlos" "Slim" NA NA
stringdist 包可能有助于获取矩阵 - 2014 年 6 月 R journal 中也对其进行了描述。更新:其中一种 qgram 方法可能最适合姓氏、名字或名字、姓氏
library(stringdist)
stringdistmatrix(names, names, "jaccard")
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0.0000 0.273 0.286 0.167 0.0909 0.1667 0.632 0.562 0.647 0.571
[2,] 0.2727 0.000 0.467 0.385 0.3333 0.3846 0.684 0.625 0.706 0.643
[3,] 0.2857 0.467 0.000 0.143 0.2143 0.1429 0.636 0.579 0.714 0.667
[4,] 0.1667 0.385 0.143 0.000 0.2308 0.2857 0.667 0.611 0.684 0.625
[5,] 0.0909 0.333 0.214 0.231 0.0000 0.0833 0.579 0.500 0.667 0.600
...
基于 adist
和聚类的完整答案。
使用参数 partial=TRUE
和 ignore.case=TRUE
,函数
adist
from base R
似乎可以解决这个问题。长久以来
haul,Chris S 指出的库 stringdist
似乎
很有前途,但也可以使用这种方法。
此解决方案通过 hclust
使用集群,采用 'single linkage'
采用 'friends of friends' 方法的方法适合
对于这个问题。
请注意,这需要根据簇高度选择阈值
(在这种情况下,累积广义 Levenshtein 距离
通过 single-link 标准查看的名称)。如果聚类不是太
对于您的问题而言,比可视化或检查的输出昂贵
hclust
应该也不错。
## renamed to avoid overwriting names() function
raw_names <- c("William Gates", "Bill Gates", "Gates, William H. III",
"Gates, William III", "William H Gates", "William H. Gates",
"Carlos Slim Helu & family", "Carlos Slim Helu",
"Carlos Slim & Family", "Carlos Slim")
lev_dist <- adist(raw_names, raw_names, partial=TRUE, ignore.case=TRUE)
#use single linkage method as it suits the problem
hc <- hclust(as.dist(lev_dist), method='single')
## cluster vis for picking threshold
plot(hc, labels=raw_names)
threshold <- 6 ## in terms of cluster height --
## based on threshold, get clusters and make labels
cluster <- cutree(hc, h=threshold)
cluster_labels <- sapply(unique(cluster), function(i) raw_names[min(which(cluster == i))])
(new_names <- cluster_labels[cluster])
## [1] "William Gates" "William Gates" "William Gates"
## "Carlos Slim Helu & family" "Carlos Slim Helu & family" [6]
## "William Gates" "William Gates" "William Gates"
## "Carlos Slim Helu & family" "Carlos Slim Helu & family"