我想根据名字和姓氏在数据框中找到重复项。使用部分字符串匹配

I will like to find duplicates in a data frame base on name and last name. Using partial string matching

这是一个示例数据框,我只想使用部分字符串匹配来确定它们是否有任何可以找到重复项的功能。

df

name   last
Joseph Smith
Jose   Smith
Joseph Smit
Maria  Cruz
maria  cru
Mari   Cruz

数据准备

使用 dplyr,首先将 firstlast 名称连接成 whole 名称

library(dplyr)
df1 <- df %>%
          rowwise() %>%              # rowwise operation
          mutate(whole=paste0(name,last,collapse=""))  # concatenate first and last name by row
          ungroup()                  # remove rowwise grouping

输出

    name   last       whole
1 Joseph  Smith JosephSmith
2   Jose  Smith   JoseSmith
3 Joseph   Smit  JosephSmit
4  Maria   Cruz   MariaCruz
5  maria    cru    mariacru
6   Mari   Cruz    MariCruz

对相似的字符串进行分组

这个递归函数会使用agrepl,逻辑近似grep,找到相关的字符串分组,分组并标注grp注意 对字符串差异的容忍度由max.distance 设置。数字越小越严格

desired <- NULL
grp <- 1    
special <- function(x, y, grp) {
                if (nrow(y) < 1) {        # if y is empty return data
                     return(x)
                } else {
                     similar <- agrepl(y$whole[1], y$whole, max.distance=0.4)      # find similar occurring strings
                     x <- rbind(x, y[similar,] %>% mutate(grp=grp))    # save similar strings
                     y <- setdiff(y, y[similar,])        # remaining non-similar strings
                     special(x, y, grp+1)       # run function again on non-similar strings
                }
            }

desired <- special(desired, df1, grp)

输出

    name   last       whole   grp
1 Joseph  Smith JosephSmith     1
2   Jose  Smith   JoseSmith     1
3 Joseph   Smit  JosephSmit     1
4  Maria   Cruz   MariaCruz     2
5  maria    cru    mariacru     2
6   Mari   Cruz    MariCruz     2

干掉whole

df2 <- df1 %>% select(-whole)