Select 相似的句子

Question

如果我有一组句子并且我想提取重复项，我应该像下面的例子那样工作：

sentences<-c("So there I was at the mercy of three monstrous trolls",
         "Today is my One Hundred and Eleventh birthday",
         "I'm sorry I brought this upon you, my",
         "So there I was at the mercy of three monstrous trolls",
         "Today is my One Hundred and Eleventh birthday",
         "I'm sorry I brought this upon you, my")

sentences[duplicated(sentences)]

其中 return 个：

[1] "So there I was at the mercy of three monstrous trolls"
[2] "Today is my One Hundred and Eleventh birthday"        
[3] "I'm sorry I brought this upon you, my"

但在我的例子中，我有一些彼此相似的句子（例如，由于拼写错误），我想 select 那些彼此更相似的句子。例如：

sentences<-c("So there I was at the mercy of three monstrous trolls",
             "Today is my One Hundred and Eleventh birthday",
             "I'm sorry I brrrought this upon, my",
             "So there I was at mercy of three monstrous troll",
             "Today is One Hundred Eleventh birthday",
             "I'm sorry I brought this upon you, my")

根据这个例子，我想select以下每对之间的一对：

I'm sorry I brought this upon you, my
I'm sorry I brrrought this upon, my

Today is One Hundred Eleventh birthday
Today is my One Hundred and Eleventh birthday

So there I was at the mercy of three monstrous trolls
So there I was at mercy of three monstrous troll

RecordLinkage 包中的 levenshteinSim 函数可以帮助我：

library(RecordLinkage)


levenshteinSim(sentences[1],sentences[2])
levenshteinSim(sentences[1],sentences[3])
levenshteinSim(sentences[1],sentences[4])
levenshteinSim(sentences[1],sentences[5])
levenshteinSim(sentences[1],sentences[6])

levenshteinSim(sentences[2],sentences[3])
levenshteinSim(sentences[2],sentences[4])
levenshteinSim(sentences[2],sentences[5])
levenshteinSim(sentences[2],sentences[6])

依此类推，return 值接近 1 表示最相似的句子。我可以写一个双 for loop 和 select，例如，那些 Levenshtein 编辑距离大于 0.7 的句子对（例如）。但是，难道没有更简单的方法吗？

Answer 1

TLDR：可能您可以使用词袋 (BoW) 表示并将这些句子转换为向量。然后，简单地检查相关性，如果它们与另一个相关性太高，则将其剔除。

词袋
让我们想想下面的句子：

杰克是个帅气的帅哥

并假设我们的整个单词世界都在那个句子中。然后，我们可以简单地为这个句子中出现的单词数量创建一个向量（每个单词 1 个），这是一个具有 5 特征的向量（Jack, is, a, handsome, man）。

那么对应的BoW表示为：[1, 1, 1, 2, 1].
这个宇宙中的另一个句子可能是，

杰克杰克帅气的帅哥

同样，我们可以使用我们的 5 特征向量来表示这句话。

[2, 0, 0, 2, 1].

然后，你可以计算这句话在R中的相似度

# Jack is a handsome, handsome man
first <- c(1,1,1,2,1)

# Jack Jack handsome handsome man
second <- c(2,0,0,2,1)

cor(first, second, method = "pearson")
#> [1] 0.559017

Answer 2

您可以使用 adist 计算近似字符串距离矩阵，它基于广义 Levenstein 距离，然后使用 hclust.

进行层次聚类

ld  <- adist(tolower(sentences))
hc <- hclust(as.dist(ld))
data.frame(x=sentences, cl=cutree(hc, h=10))
#                                                       x cl
# 1 So there I was at the mercy of three monstrous trolls  1
# 2         Today is my One Hundred and Eleventh birthday  2
# 3                   I'm sorry I brrrought this upon, my  3
# 4      So there I was at mercy of three monstrous troll  1
# 5                Today is One Hundred Eleventh birthday  2
# 6                 I'm sorry I brought this upon you, my  3

要在 cutree 中找到 h=8 的合适值，我们可以绘制树状图。

plot(hc)
abline(h=10, col=2, lty=2)

Answer 3

您可以为每个句子生成嵌入，然后计算它们之间的余弦相似度。

可以使用基于 BERT 的模型或 GLOVE 模型生成嵌入。

BERT：Sentence transformer & 非常具体的语义相似度或释义挖掘。

GLOVE：标记句子，清理停用词，使用引理获取基本词，生成词嵌入并将它们合并为一个嵌入，然后计算相似度分数，即余弦距离同样。

相似度得分 > 93 - 95% 将为您提供最相似句子的所有列表。

Select 相似的句子

Select similar sentences

r

edit-distance

string-comparison