Select 相似的句子
Select similar sentences
如果我有一组句子并且我想提取重复项,我应该像下面的例子那样工作:
sentences<-c("So there I was at the mercy of three monstrous trolls",
"Today is my One Hundred and Eleventh birthday",
"I'm sorry I brought this upon you, my",
"So there I was at the mercy of three monstrous trolls",
"Today is my One Hundred and Eleventh birthday",
"I'm sorry I brought this upon you, my")
sentences[duplicated(sentences)]
其中 return 个:
[1] "So there I was at the mercy of three monstrous trolls"
[2] "Today is my One Hundred and Eleventh birthday"
[3] "I'm sorry I brought this upon you, my"
但在我的例子中,我有一些彼此相似的句子(例如,由于拼写错误),我想 select 那些彼此更相似的句子。例如:
sentences<-c("So there I was at the mercy of three monstrous trolls",
"Today is my One Hundred and Eleventh birthday",
"I'm sorry I brrrought this upon, my",
"So there I was at mercy of three monstrous troll",
"Today is One Hundred Eleventh birthday",
"I'm sorry I brought this upon you, my")
根据这个例子,我想select以下每对之间的一对:
I'm sorry I brought this upon you, my
I'm sorry I brrrought this upon, my
Today is One Hundred Eleventh birthday
Today is my One Hundred and Eleventh birthday
So there I was at the mercy of three monstrous trolls
So there I was at mercy of three monstrous troll
RecordLinkage
包中的 levenshteinSim
函数可以帮助我:
library(RecordLinkage)
levenshteinSim(sentences[1],sentences[2])
levenshteinSim(sentences[1],sentences[3])
levenshteinSim(sentences[1],sentences[4])
levenshteinSim(sentences[1],sentences[5])
levenshteinSim(sentences[1],sentences[6])
levenshteinSim(sentences[2],sentences[3])
levenshteinSim(sentences[2],sentences[4])
levenshteinSim(sentences[2],sentences[5])
levenshteinSim(sentences[2],sentences[6])
依此类推,return 值接近 1 表示最相似的句子。我可以写一个双 for loop
和 select,例如,那些 Levenshtein 编辑距离大于 0.7 的句子对(例如)。但是,难道没有更简单的方法吗?
TLDR:可能您可以使用词袋 (BoW) 表示并将这些句子转换为向量。然后,简单地检查相关性,如果它们与另一个相关性太高,则将其剔除。
词袋
让我们想想下面的句子:
- 杰克是个帅气的帅哥
并假设我们的整个单词世界都在那个句子中。然后,我们可以简单地为这个句子中出现的单词数量创建一个向量(每个单词 1 个),这是一个具有 5
特征的向量(Jack, is, a, handsome, man)。
那么对应的BoW表示为:[1, 1, 1, 2, 1]
.
这个宇宙中的另一个句子可能是,
- 杰克杰克帅气的帅哥
同样,我们可以使用我们的 5
特征向量来表示这句话。
[2, 0, 0, 2, 1]
.
然后,你可以计算这句话在R中的相似度
# Jack is a handsome, handsome man
first <- c(1,1,1,2,1)
# Jack Jack handsome handsome man
second <- c(2,0,0,2,1)
cor(first, second, method = "pearson")
#> [1] 0.559017
您可以使用 adist
计算近似字符串距离矩阵,它基于广义 Levenstein 距离,然后使用 hclust
.
进行层次聚类
ld <- adist(tolower(sentences))
hc <- hclust(as.dist(ld))
data.frame(x=sentences, cl=cutree(hc, h=10))
# x cl
# 1 So there I was at the mercy of three monstrous trolls 1
# 2 Today is my One Hundred and Eleventh birthday 2
# 3 I'm sorry I brrrought this upon, my 3
# 4 So there I was at mercy of three monstrous troll 1
# 5 Today is One Hundred Eleventh birthday 2
# 6 I'm sorry I brought this upon you, my 3
要在 cutree
中找到 h=
8 的合适值,我们可以绘制树状图。
plot(hc)
abline(h=10, col=2, lty=2)
您可以为每个句子生成嵌入,然后计算它们之间的余弦相似度。
可以使用基于 BERT 的模型或 GLOVE 模型生成嵌入。
BERT:Sentence transformer & 非常具体的语义相似度或释义挖掘。
GLOVE:标记句子,清理停用词,使用引理获取基本词,生成词嵌入并将它们合并为一个嵌入,然后计算相似度分数,即余弦距离同样。
相似度得分 > 93 - 95% 将为您提供最相似句子的所有列表。
如果我有一组句子并且我想提取重复项,我应该像下面的例子那样工作:
sentences<-c("So there I was at the mercy of three monstrous trolls",
"Today is my One Hundred and Eleventh birthday",
"I'm sorry I brought this upon you, my",
"So there I was at the mercy of three monstrous trolls",
"Today is my One Hundred and Eleventh birthday",
"I'm sorry I brought this upon you, my")
sentences[duplicated(sentences)]
其中 return 个:
[1] "So there I was at the mercy of three monstrous trolls"
[2] "Today is my One Hundred and Eleventh birthday"
[3] "I'm sorry I brought this upon you, my"
但在我的例子中,我有一些彼此相似的句子(例如,由于拼写错误),我想 select 那些彼此更相似的句子。例如:
sentences<-c("So there I was at the mercy of three monstrous trolls",
"Today is my One Hundred and Eleventh birthday",
"I'm sorry I brrrought this upon, my",
"So there I was at mercy of three monstrous troll",
"Today is One Hundred Eleventh birthday",
"I'm sorry I brought this upon you, my")
根据这个例子,我想select以下每对之间的一对:
I'm sorry I brought this upon you, my
I'm sorry I brrrought this upon, my
Today is One Hundred Eleventh birthday
Today is my One Hundred and Eleventh birthday
So there I was at the mercy of three monstrous trolls
So there I was at mercy of three monstrous troll
RecordLinkage
包中的 levenshteinSim
函数可以帮助我:
library(RecordLinkage)
levenshteinSim(sentences[1],sentences[2])
levenshteinSim(sentences[1],sentences[3])
levenshteinSim(sentences[1],sentences[4])
levenshteinSim(sentences[1],sentences[5])
levenshteinSim(sentences[1],sentences[6])
levenshteinSim(sentences[2],sentences[3])
levenshteinSim(sentences[2],sentences[4])
levenshteinSim(sentences[2],sentences[5])
levenshteinSim(sentences[2],sentences[6])
依此类推,return 值接近 1 表示最相似的句子。我可以写一个双 for loop
和 select,例如,那些 Levenshtein 编辑距离大于 0.7 的句子对(例如)。但是,难道没有更简单的方法吗?
TLDR:可能您可以使用词袋 (BoW) 表示并将这些句子转换为向量。然后,简单地检查相关性,如果它们与另一个相关性太高,则将其剔除。
词袋
让我们想想下面的句子:
- 杰克是个帅气的帅哥
并假设我们的整个单词世界都在那个句子中。然后,我们可以简单地为这个句子中出现的单词数量创建一个向量(每个单词 1 个),这是一个具有 5
特征的向量(Jack, is, a, handsome, man)。
那么对应的BoW表示为:[1, 1, 1, 2, 1]
.
这个宇宙中的另一个句子可能是,
- 杰克杰克帅气的帅哥
同样,我们可以使用我们的 5
特征向量来表示这句话。
[2, 0, 0, 2, 1]
.
然后,你可以计算这句话在R中的相似度
# Jack is a handsome, handsome man
first <- c(1,1,1,2,1)
# Jack Jack handsome handsome man
second <- c(2,0,0,2,1)
cor(first, second, method = "pearson")
#> [1] 0.559017
您可以使用 adist
计算近似字符串距离矩阵,它基于广义 Levenstein 距离,然后使用 hclust
.
ld <- adist(tolower(sentences))
hc <- hclust(as.dist(ld))
data.frame(x=sentences, cl=cutree(hc, h=10))
# x cl
# 1 So there I was at the mercy of three monstrous trolls 1
# 2 Today is my One Hundred and Eleventh birthday 2
# 3 I'm sorry I brrrought this upon, my 3
# 4 So there I was at mercy of three monstrous troll 1
# 5 Today is One Hundred Eleventh birthday 2
# 6 I'm sorry I brought this upon you, my 3
要在 cutree
中找到 h=
8 的合适值,我们可以绘制树状图。
plot(hc)
abline(h=10, col=2, lty=2)
您可以为每个句子生成嵌入,然后计算它们之间的余弦相似度。
可以使用基于 BERT 的模型或 GLOVE 模型生成嵌入。
BERT:Sentence transformer & 非常具体的语义相似度或释义挖掘。
GLOVE:标记句子,清理停用词,使用引理获取基本词,生成词嵌入并将它们合并为一个嵌入,然后计算相似度分数,即余弦距离同样。
相似度得分 > 93 - 95% 将为您提供最相似句子的所有列表。