根据语义 similarity/relatedness 从列表中删除重复项
remove duplicates from list based on semantic similarity/relatedness
R + tm:如何根据语义相似性删除列表中的重复项目?
v<-c("bank","banks","banking", "ford_suv',"toyota_suv","nissan_suv")
。我预期的解决方案是 c("bank", "ford_suv',"toyota_suv","nissan_suv")
。也就是bank, banks, banking 缩减成一个term "bank." SnowBall::stemming
不是一个选项,因为我要保留各国报纸风格的味道。任何帮助或指导都会很有用。
我们可以使用 adist
计算单词之间的 Levenshtein 距离,并使用 hclust
将它们重新分组到集群中
d <- adist(v)
rownames(d) <- v
这给出了项之间的距离矩阵:
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
#bank 0 1 3 8 9 8 2 13 6 5 3 4
#banks 1 0 3 7 9 7 2 13 6 6 2 5
#banking 3 3 0 8 10 8 3 13 7 6 3 7
#ford_suv 8 7 8 0 5 6 8 12 7 7 8 4
#toyota_suv 9 9 10 5 0 6 9 7 4 9 9 9
#nissan_suv 8 7 8 6 6 0 8 13 10 4 8 10
#banker 2 2 3 8 9 8 0 12 6 6 1 6
#toyota_corolla 13 13 13 12 7 13 12 0 8 13 12 12
#toyota 6 6 7 7 4 10 6 8 0 6 7 5
#nissan 5 6 6 7 9 4 6 13 6 0 7 6
#bankers 3 2 3 8 9 8 1 12 7 7 0 6
#ford 4 5 7 4 9 10 6 12 5 6 6 0
然后我们可以使用 method = ward.D
将其传递给 hclust
cl <- hclust(as.dist(d), method = "ward.D")
plot(cl)
给出:
我们注意到 4 个不同的集群(我们可以使用 rect.hclust(cl, 4)
来说明)
现在,我们可以将此结果转换为 data.frame
并用最短的术语标记每个聚类:
library(dplyr)
data.frame(group = cutree(cl, 4)) %>%
tibble::rownames_to_column("term") %>%
group_by(group) %>%
mutate(tag = term[nchar(term) == min(nchar(term))])
给出:
#Source: local data frame [12 x 3]
#Groups: group [4]
#
# term group tag
# <chr> <int> <chr>
#1 bank 1 bank
#2 banks 1 bank
#3 banking 1 bank
#4 ford_suv 2 ford
#5 toyota_suv 3 toyota
#6 nissan_suv 4 nissan
#7 banker 1 bank
#8 toyota_corolla 3 toyota
#9 toyota 3 toyota
#10 nissan 4 nissan
#11 bankers 1 bank
#12 ford 2 ford
如果我们只想为每个集群提取唯一的 tag
,我们可以将 ... %>% distinct(tag) %>% .$tag
添加到管道中,这将给出:
#[1] "bank" "ford" "toyota" "nissan"
参考
?adist
The (generalized) Levenshtein (or edit) distance between two strings s
and t is the minimal possibly weighted number of insertions, deletions
and substitutions needed to transform s into t (so that the
transformation exactly matches t).
?hclust
This function performs a hierarchical cluster analysis using a set of
dissimilarities for the n objects being clustered. Initially, each
object is assigned to its own cluster and then the algorithm proceeds
iteratively, at each stage joining the two most similar clusters,
continuing until there is just a single cluster.
注意:我在评论中使用了@Abdou 提供的数据,因为它代表了更完整的用例
R + tm:如何根据语义相似性删除列表中的重复项目?
v<-c("bank","banks","banking", "ford_suv',"toyota_suv","nissan_suv")
。我预期的解决方案是 c("bank", "ford_suv',"toyota_suv","nissan_suv")
。也就是bank, banks, banking 缩减成一个term "bank." SnowBall::stemming
不是一个选项,因为我要保留各国报纸风格的味道。任何帮助或指导都会很有用。
我们可以使用 adist
计算单词之间的 Levenshtein 距离,并使用 hclust
d <- adist(v)
rownames(d) <- v
这给出了项之间的距离矩阵:
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
#bank 0 1 3 8 9 8 2 13 6 5 3 4
#banks 1 0 3 7 9 7 2 13 6 6 2 5
#banking 3 3 0 8 10 8 3 13 7 6 3 7
#ford_suv 8 7 8 0 5 6 8 12 7 7 8 4
#toyota_suv 9 9 10 5 0 6 9 7 4 9 9 9
#nissan_suv 8 7 8 6 6 0 8 13 10 4 8 10
#banker 2 2 3 8 9 8 0 12 6 6 1 6
#toyota_corolla 13 13 13 12 7 13 12 0 8 13 12 12
#toyota 6 6 7 7 4 10 6 8 0 6 7 5
#nissan 5 6 6 7 9 4 6 13 6 0 7 6
#bankers 3 2 3 8 9 8 1 12 7 7 0 6
#ford 4 5 7 4 9 10 6 12 5 6 6 0
然后我们可以使用 method = ward.D
hclust
cl <- hclust(as.dist(d), method = "ward.D")
plot(cl)
给出:
我们注意到 4 个不同的集群(我们可以使用 rect.hclust(cl, 4)
来说明)
现在,我们可以将此结果转换为 data.frame
并用最短的术语标记每个聚类:
library(dplyr)
data.frame(group = cutree(cl, 4)) %>%
tibble::rownames_to_column("term") %>%
group_by(group) %>%
mutate(tag = term[nchar(term) == min(nchar(term))])
给出:
#Source: local data frame [12 x 3]
#Groups: group [4]
#
# term group tag
# <chr> <int> <chr>
#1 bank 1 bank
#2 banks 1 bank
#3 banking 1 bank
#4 ford_suv 2 ford
#5 toyota_suv 3 toyota
#6 nissan_suv 4 nissan
#7 banker 1 bank
#8 toyota_corolla 3 toyota
#9 toyota 3 toyota
#10 nissan 4 nissan
#11 bankers 1 bank
#12 ford 2 ford
如果我们只想为每个集群提取唯一的 tag
,我们可以将 ... %>% distinct(tag) %>% .$tag
添加到管道中,这将给出:
#[1] "bank" "ford" "toyota" "nissan"
参考
?adist
The (generalized) Levenshtein (or edit) distance between two strings s and t is the minimal possibly weighted number of insertions, deletions and substitutions needed to transform s into t (so that the transformation exactly matches t).
?hclust
This function performs a hierarchical cluster analysis using a set of dissimilarities for the n objects being clustered. Initially, each object is assigned to its own cluster and then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster.
注意:我在评论中使用了@Abdou 提供的数据,因为它代表了更完整的用例