根据语义 similarity/relatedness 从列表中删除重复项

remove duplicates from list based on semantic similarity/relatedness

R + tm:如何根据语义相似性删除列表中的重复项目? v<-c("bank","banks","banking", "ford_suv',"toyota_suv","nissan_suv")。我预期的解决方案是 c("bank", "ford_suv',"toyota_suv","nissan_suv")。也就是bank, banks, banking 缩减成一个term "bank." SnowBall::stemming 不是一个选项,因为我要保留各国报纸风格的味道。任何帮助或指导都会很有用。

我们可以使用 adist 计算单词之间的 Levenshtein 距离,并使用 hclust

将它们重新分组到集群中
d <- adist(v)
rownames(d) <- v

这给出了项之间的距离矩阵:

#              [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
#bank              0    1    3    8    9    8    2   13    6     5     3     4
#banks             1    0    3    7    9    7    2   13    6     6     2     5
#banking           3    3    0    8   10    8    3   13    7     6     3     7
#ford_suv          8    7    8    0    5    6    8   12    7     7     8     4
#toyota_suv        9    9   10    5    0    6    9    7    4     9     9     9
#nissan_suv        8    7    8    6    6    0    8   13   10     4     8    10
#banker            2    2    3    8    9    8    0   12    6     6     1     6
#toyota_corolla   13   13   13   12    7   13   12    0    8    13    12    12
#toyota            6    6    7    7    4   10    6    8    0     6     7     5
#nissan            5    6    6    7    9    4    6   13    6     0     7     6
#bankers           3    2    3    8    9    8    1   12    7     7     0     6
#ford              4    5    7    4    9   10    6   12    5     6     6     0

然后我们可以使用 method = ward.D

将其传递给 hclust
cl <- hclust(as.dist(d), method  = "ward.D")
plot(cl)

给出:

我们注意到 4 个不同的集群(我们可以使用 rect.hclust(cl, 4) 来说明)

现在,我们可以将此结果转换为 data.frame 并用最短的术语标记每个聚类:

library(dplyr)
data.frame(group = cutree(cl, 4)) %>%
  tibble::rownames_to_column("term") %>%
  group_by(group) %>%
  mutate(tag = term[nchar(term) == min(nchar(term))]) 

给出:

#Source: local data frame [12 x 3]
#Groups: group [4]
#
#             term group      tag
#            <chr> <int>    <chr>
#1            bank     1     bank
#2           banks     1     bank
#3         banking     1     bank
#4        ford_suv     2     ford
#5      toyota_suv     3   toyota
#6      nissan_suv     4   nissan
#7          banker     1     bank
#8  toyota_corolla     3   toyota
#9          toyota     3   toyota
#10         nissan     4   nissan
#11        bankers     1     bank
#12           ford     2     ford

如果我们只想为每个集群提取唯一的 tag,我们可以将 ... %>% distinct(tag) %>% .$tag 添加到管道中,这将给出:

#[1] "bank"   "ford"   "toyota" "nissan"

参考

?adist

The (generalized) Levenshtein (or edit) distance between two strings s and t is the minimal possibly weighted number of insertions, deletions and substitutions needed to transform s into t (so that the transformation exactly matches t).

?hclust

This function performs a hierarchical cluster analysis using a set of dissimilarities for the n objects being clustered. Initially, each object is assigned to its own cluster and then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster.


注意:我在评论中使用了@Abdou 提供的数据,因为它代表了更完整的用例