如何在 R 的数据库中将相似的字符串组合在一起

How to group similar strings together in a database in R

我有一个只有 1 列的标题,名为 'title'。

> dat
# A tibble: 13 x 1
   title                                          
   <chr>                                          
 1 lymphoedema clinic                             
 2 zostavax shingles vaccine                      
 3 xray operator                                  
 4 workplace mental health wellbeing workshop     
 5 zostavax recall toolkit                        
 6 xray meetint                                   
 7 workplace mental health and wellbeing          
 8 lymphoedema early intervenstion                
 9 lymphoedema expo                               
10 lymphoedema for breast care nurses             
11 xray meeting and case studies                  
12 xray online examination                        
13 xray operator in service paediatric extremities

我希望找到相似的记录并将它们组合在一起(同时保持它们的索引):

> dat
# A tibble: 13 x 1
   title                                          
   <chr>                                          
 1 lymphoedema clinic   
 8 lymphoedema early intervenstion                
 9 lymphoedema expo                               
10 lymphoedema for breast care nurses                             
 2 zostavax shingles vaccine 
 5 zostavax recall toolkit                                
 3 xray operator                                  
 6 xray meetint     
11 xray meeting and case studies                  
12 xray online examination                        
13 xray operator in service paediatric extremities
 4 workplace mental health wellbeing workshop                                  
 7 workplace mental health and wellbeing          

我正在使用以下函数查找彼此足够接近的字符串(截止值 = 0.75)

compareJW <- function(string1, string2, cutoff)
{
  require(RecordLinkage)
  jarowinkler(string1, string2) > cutoff
}

我已经在一个新的数据框中实现了下面的循环 'send' 相似的记录,但它不能正常工作,我尝试了一些变体,但还没有任何效果。

# create new database
newDB <- data.frame(matrix(ncol = ncol(dat), nrow = 0))
colnames(newDB) <- names(dat)
newDB <- as_tibble(newDB)

for(i in 1:nrow(dat))
{
  # print(dat$title[i])

  for(j in 1:nrow(dat))
  {
    print(dat$title[i])
    print(dat$title[j])
    # score <- jarowinkler(dat$title[i], dat$title[j])

    if(dat$title[i] != dat$title[j]
       &&
       compareJW(dat$title[i], dat$title[j], 0.75))
    {
      print("if")

      # newDB <- rbind(newDB, 
      #                dat$title[i],
      #                dat$title[j])
    }
    else
    {
      print("else")
      # newDB <- rbind(newDB, dat$title[i])
    }
  }
}

(我在循环中插入了打印“以查看发生了什么”)

可重现的数据:

dat <- 
structure(list(title = c("lymphoedema clinic", "zostavax shingles vaccine", 
                         "xray operator", "workplace mental health wellbeing workshop", 
                         "zostavax recall toolkit", "xray meetint", "workplace mental health and wellbeing", 
                         "lymphoedema early intervenstion", "lymphoedema expo", "lymphoedema for breast care nurses", 
                         "xray meeting and case studies", "xray online examination", "xray operator in service paediatric extremities"
)), row.names = c(NA, -13L), class = c("tbl_df", "tbl", "data.frame"
))

有什么建议吗? 编辑:我还想要一个名为 'group' 的新索引列,如下所示:

> dat
# A tibble: 13 x 1
index   group    title                                          
                 <chr>                                          
 1       1   lymphoedema clinic   
 8       1   lymphoedema early intervenstion                
 9       1   lymphoedema expo                               
10       1   lymphoedema for breast care nurses                             
 2       2   zostavax shingles vaccine 
 5       2   zostavax recall toolkit                                
 3       3   xray operator                                  
 6       3   xray meetint     
11       3   xray meeting and case studies                  
12       3   xray online examination                        
13       3   xray operator in service paediatric extremities
 4       4   workplace mental health wellbeing workshop                                  
 7       4   workplace mental health and wellbeing          

恐怕我从未尝试过 RecordLinkage,但如果你只是使用 Jaro-Winkler 距离,那么用 [=13= 聚类相似的字符串应该也相当容易] 包裹。使用上面的 dput

library(tidyverse)
library(stringdist)

map_dfr(dat$title, ~ {
    i <- which(stringdist(., dat$title, "jw") < 0.40)
    tibble(index = i, title = dat$title[i])
}, .id = "group") %>%
    distinct(index, .keep_all = T) %>% 
    mutate(group = as.integer(group))

说明: map_dfr 遍历 dat$title 中的每个字符串,提取由 stringdist 计算的最接近匹配项的索引(受 0.40 约束,即您的 "threshold"),创建带有索引的小标题并匹配,然后将这些 tibbles 与 group 变量堆叠起来,该变量对应于原始字符串的整数位置(和行号)。 distinct 然后根据 index.

的重复删除任何集群重复项

输出:

# A tibble: 13 x 3
   group index title                                          
   <int> <int> <chr>                                          
 1     1     1 lymphoedema clinic                             
 2     1     8 lymphoedema early intervenstion                
 3     1     9 lymphoedema expo                               
 4     1    10 lymphoedema for breast care nurses             
 5     2     2 zostavax shingles vaccine                      
 6     2     5 zostavax recall toolkit                        
 7     2    11 xray meeting and case studies                  
 8     3     3 xray operator                                  
 9     3     6 xray meetint                                   
10     3    12 xray online examination                        
11     3    13 xray operator in service paediatric extremities
12     4     4 workplace mental health wellbeing workshop     
13     4     7 workplace mental health and wellbeing          

一个有趣的替代方法是使用 tidytextwidyr 来按词标记并根据相似的 计算标题的余弦相似度,而不是上面的字符。