我想根据名字和姓氏在数据框中找到重复项。使用部分字符串匹配
I will like to find duplicates in a data frame base on name and last name. Using partial string matching
这是一个示例数据框,我只想使用部分字符串匹配来确定它们是否有任何可以找到重复项的功能。
df
name last
Joseph Smith
Jose Smith
Joseph Smit
Maria Cruz
maria cru
Mari Cruz
数据准备
使用 dplyr
,首先将 first
和 last
名称连接成 whole
名称
library(dplyr)
df1 <- df %>%
rowwise() %>% # rowwise operation
mutate(whole=paste0(name,last,collapse="")) # concatenate first and last name by row
ungroup() # remove rowwise grouping
输出
name last whole
1 Joseph Smith JosephSmith
2 Jose Smith JoseSmith
3 Joseph Smit JosephSmit
4 Maria Cruz MariaCruz
5 maria cru mariacru
6 Mari Cruz MariCruz
对相似的字符串进行分组
这个递归函数会使用agrepl
,逻辑近似grep,找到相关的字符串分组,分组并标注grp
。 注意 对字符串差异的容忍度由max.distance
设置。数字越小越严格
desired <- NULL
grp <- 1
special <- function(x, y, grp) {
if (nrow(y) < 1) { # if y is empty return data
return(x)
} else {
similar <- agrepl(y$whole[1], y$whole, max.distance=0.4) # find similar occurring strings
x <- rbind(x, y[similar,] %>% mutate(grp=grp)) # save similar strings
y <- setdiff(y, y[similar,]) # remaining non-similar strings
special(x, y, grp+1) # run function again on non-similar strings
}
}
desired <- special(desired, df1, grp)
输出
name last whole grp
1 Joseph Smith JosephSmith 1
2 Jose Smith JoseSmith 1
3 Joseph Smit JosephSmit 1
4 Maria Cruz MariaCruz 2
5 maria cru mariacru 2
6 Mari Cruz MariCruz 2
干掉whole
df2 <- df1 %>% select(-whole)
这是一个示例数据框,我只想使用部分字符串匹配来确定它们是否有任何可以找到重复项的功能。
df
name last
Joseph Smith
Jose Smith
Joseph Smit
Maria Cruz
maria cru
Mari Cruz
数据准备
使用 dplyr
,首先将 first
和 last
名称连接成 whole
名称
library(dplyr)
df1 <- df %>%
rowwise() %>% # rowwise operation
mutate(whole=paste0(name,last,collapse="")) # concatenate first and last name by row
ungroup() # remove rowwise grouping
输出
name last whole
1 Joseph Smith JosephSmith
2 Jose Smith JoseSmith
3 Joseph Smit JosephSmit
4 Maria Cruz MariaCruz
5 maria cru mariacru
6 Mari Cruz MariCruz
对相似的字符串进行分组
这个递归函数会使用agrepl
,逻辑近似grep,找到相关的字符串分组,分组并标注grp
。 注意 对字符串差异的容忍度由max.distance
设置。数字越小越严格
desired <- NULL
grp <- 1
special <- function(x, y, grp) {
if (nrow(y) < 1) { # if y is empty return data
return(x)
} else {
similar <- agrepl(y$whole[1], y$whole, max.distance=0.4) # find similar occurring strings
x <- rbind(x, y[similar,] %>% mutate(grp=grp)) # save similar strings
y <- setdiff(y, y[similar,]) # remaining non-similar strings
special(x, y, grp+1) # run function again on non-similar strings
}
}
desired <- special(desired, df1, grp)
输出
name last whole grp
1 Joseph Smith JosephSmith 1
2 Jose Smith JoseSmith 1
3 Joseph Smit JosephSmit 1
4 Maria Cruz MariaCruz 2
5 maria cru mariacru 2
6 Mari Cruz MariCruz 2
干掉whole
df2 <- df1 %>% select(-whole)