在 python 中找到最相似的句子
Finding most similar sentences among all in python
建议/参考链接/代码表示赞赏。
我有一个超过 1500 行的数据。每一行都有一个句子。我正在尝试找出在所有句子中找到最相似句子的最佳方法。
我试过的
我试过 K-mean 算法,该算法将相似的句子分组到一个簇中。但是我发现一个缺点,我必须通过 K 才能创建集群。很难猜K。我尝试了 elbo 方法来猜测集群,但将所有分组在一起是不够的。在这种方法中,我将所有数据分组。我正在寻找与 0.90% 以上相似的数据,数据应与 ID 一起返回。
我尝试了余弦相似度,其中我使用 TfidfVectorizer
创建矩阵,然后传入余弦相似度。即使这种方法也无法正常工作。
我在找什么
我想要一种方法,我可以通过一个 阈值 示例 0.90 数据在所有行中彼此相似超过 0.90% 应该作为结果返回。
Data Sample
ID | DESCRIPTION
-----------------------------
10 | Cancel ASN WMS Cancel ASN
11 | MAXPREDO Validation is corect
12 | Move to QC
13 | Cancel ASN WMS Cancel ASN
14 | MAXPREDO Validation is right
15 | Verify files are sent every hours for this interface from Optima
16 | MAXPREDO Validation are correct
17 | Move to QC
18 | Verify files are not sent
预期结果
以上相似度达 0.90% 的数据应该得到 ID
ID | DESCRIPTION
-----------------------------
10 | Cancel ASN WMS Cancel ASN
13 | Cancel ASN WMS Cancel ASN
11 | MAXPREDO Validation is corect # even spelling is not correct
14 | MAXPREDO Validation is right
16 | MAXPREDO Validation are correct
12 | Move to QC
17 | Move to QC
一种可能的方法是使用 word-embeddings 创建 vector-representations 个句子。就像你使用预训练的 word-embeddings 并让 rnn 层创建一个句子 vector-representation,其中每个句子的 word-embeddings 被组合在一起。然后你有一个向量,你可以在其中计算它们之间的距离。但是你需要决定,你想设置哪个阈值,所以一个句子被认为是相似的,因为 word-embeddings 的尺度是不固定的。
更新
我做了一些实验。在我看来,这是完成此类任务的可行方法,但是,您可能想亲自了解它在您的案例中的效果如何。我在 git repository.
中创建了一个示例
word-mover-distance 算法也可用于此任务。您可以在此媒体 article.
中找到有关此主题的更多信息
为什么它对余弦相似度和 TFIDF-vectorizer 不起作用?
我试过了,它适用于以下代码:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
df = pd.DataFrame(columns=["ID","DESCRIPTION"], data=np.matrix([[10,"Cancel ASN WMS Cancel ASN"],
[11,"MAXPREDO Validation is corect"],
[12,"Move to QC"],
[13,"Cancel ASN WMS Cancel ASN"],
[14,"MAXPREDO Validation is right"],
[15,"Verify files are sent every hours for this interface from Optima"],
[16,"MAXPREDO Validation are correct"],
[17,"Move to QC"],
[18,"Verify files are not sent"]
]))
corpus = list(df["DESCRIPTION"].values)
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
threshold = 0.4
for x in range(0,X.shape[0]):
for y in range(x,X.shape[0]):
if(x!=y):
if(cosine_similarity(X[x],X[y])>threshold):
print(df["ID"][x],":",corpus[x])
print(df["ID"][y],":",corpus[y])
print("Cosine similarity:",cosine_similarity(X[x],X[y]))
print()
阈值也可以调整,但是0.9的阈值不会得到你想要的结果。
阈值 0.4 的输出是:
10 : Cancel ASN WMS Cancel ASN
13 : Cancel ASN WMS Cancel ASN
Cosine similarity: [[1.]]
11 : MAXPREDO Validation is corect
14 : MAXPREDO Validation is right
Cosine similarity: [[0.64183024]]
12 : Move to QC
17 : Move to QC
Cosine similarity: [[1.]]
15 : Verify files are sent every hours for this interface from Optima
18 : Verify files are not sent
Cosine similarity: [[0.44897995]]
当阈值为 0.39 时,所有预期的句子都是输出中的特征,但也可以找到索引为 [15,18] 的附加对:
10 : Cancel ASN WMS Cancel ASN
13 : Cancel ASN WMS Cancel ASN
Cosine similarity: [[1.]]
11 : MAXPREDO Validation is corect
14 : MAXPREDO Validation is right
Cosine similarity: [[0.64183024]]
11 : MAXPREDO Validation is corect
16 : MAXPREDO Validation are correct
Cosine similarity: [[0.39895808]]
12 : Move to QC
17 : Move to QC
Cosine similarity: [[1.]]
14 : MAXPREDO Validation is right
16 : MAXPREDO Validation are correct
Cosine similarity: [[0.39895808]]
15 : Verify files are sent every hours for this interface from Optima
18 : Verify files are not sent
Cosine similarity: [[0.44897995]]
可以使用这个 Python 3 库来计算句子相似度:https://github.com/UKPLab/sentence-transformers
来自 https://www.sbert.net/docs/usage/semantic_textual_similarity.html 的代码示例:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('paraphrase-MiniLM-L12-v2')
# Two lists of sentences
sentences1 = ['The cat sits outside',
'A man is playing guitar',
'The new movie is awesome']
sentences2 = ['The dog plays in the garden',
'A woman watches TV',
'The new movie is so great']
#Compute embedding for both lists
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)
#Compute cosine-similarits
cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings2)
#Output the pairs with their score
for i in range(len(sentences1)):
print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i]))
该库包含最先进的句子嵌入模型。
参见 执行句子聚类。
建议/参考链接/代码表示赞赏。
我有一个超过 1500 行的数据。每一行都有一个句子。我正在尝试找出在所有句子中找到最相似句子的最佳方法。
我试过的
我试过 K-mean 算法,该算法将相似的句子分组到一个簇中。但是我发现一个缺点,我必须通过 K 才能创建集群。很难猜K。我尝试了 elbo 方法来猜测集群,但将所有分组在一起是不够的。在这种方法中,我将所有数据分组。我正在寻找与 0.90% 以上相似的数据,数据应与 ID 一起返回。
我尝试了余弦相似度,其中我使用
TfidfVectorizer
创建矩阵,然后传入余弦相似度。即使这种方法也无法正常工作。
我在找什么
我想要一种方法,我可以通过一个 阈值 示例 0.90 数据在所有行中彼此相似超过 0.90% 应该作为结果返回。
Data Sample
ID | DESCRIPTION
-----------------------------
10 | Cancel ASN WMS Cancel ASN
11 | MAXPREDO Validation is corect
12 | Move to QC
13 | Cancel ASN WMS Cancel ASN
14 | MAXPREDO Validation is right
15 | Verify files are sent every hours for this interface from Optima
16 | MAXPREDO Validation are correct
17 | Move to QC
18 | Verify files are not sent
预期结果
以上相似度达 0.90% 的数据应该得到 ID
ID | DESCRIPTION
-----------------------------
10 | Cancel ASN WMS Cancel ASN
13 | Cancel ASN WMS Cancel ASN
11 | MAXPREDO Validation is corect # even spelling is not correct
14 | MAXPREDO Validation is right
16 | MAXPREDO Validation are correct
12 | Move to QC
17 | Move to QC
一种可能的方法是使用 word-embeddings 创建 vector-representations 个句子。就像你使用预训练的 word-embeddings 并让 rnn 层创建一个句子 vector-representation,其中每个句子的 word-embeddings 被组合在一起。然后你有一个向量,你可以在其中计算它们之间的距离。但是你需要决定,你想设置哪个阈值,所以一个句子被认为是相似的,因为 word-embeddings 的尺度是不固定的。
更新
我做了一些实验。在我看来,这是完成此类任务的可行方法,但是,您可能想亲自了解它在您的案例中的效果如何。我在 git repository.
中创建了一个示例word-mover-distance 算法也可用于此任务。您可以在此媒体 article.
中找到有关此主题的更多信息为什么它对余弦相似度和 TFIDF-vectorizer 不起作用?
我试过了,它适用于以下代码:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
df = pd.DataFrame(columns=["ID","DESCRIPTION"], data=np.matrix([[10,"Cancel ASN WMS Cancel ASN"],
[11,"MAXPREDO Validation is corect"],
[12,"Move to QC"],
[13,"Cancel ASN WMS Cancel ASN"],
[14,"MAXPREDO Validation is right"],
[15,"Verify files are sent every hours for this interface from Optima"],
[16,"MAXPREDO Validation are correct"],
[17,"Move to QC"],
[18,"Verify files are not sent"]
]))
corpus = list(df["DESCRIPTION"].values)
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
threshold = 0.4
for x in range(0,X.shape[0]):
for y in range(x,X.shape[0]):
if(x!=y):
if(cosine_similarity(X[x],X[y])>threshold):
print(df["ID"][x],":",corpus[x])
print(df["ID"][y],":",corpus[y])
print("Cosine similarity:",cosine_similarity(X[x],X[y]))
print()
阈值也可以调整,但是0.9的阈值不会得到你想要的结果。
阈值 0.4 的输出是:
10 : Cancel ASN WMS Cancel ASN
13 : Cancel ASN WMS Cancel ASN
Cosine similarity: [[1.]]
11 : MAXPREDO Validation is corect
14 : MAXPREDO Validation is right
Cosine similarity: [[0.64183024]]
12 : Move to QC
17 : Move to QC
Cosine similarity: [[1.]]
15 : Verify files are sent every hours for this interface from Optima
18 : Verify files are not sent
Cosine similarity: [[0.44897995]]
当阈值为 0.39 时,所有预期的句子都是输出中的特征,但也可以找到索引为 [15,18] 的附加对:
10 : Cancel ASN WMS Cancel ASN
13 : Cancel ASN WMS Cancel ASN
Cosine similarity: [[1.]]
11 : MAXPREDO Validation is corect
14 : MAXPREDO Validation is right
Cosine similarity: [[0.64183024]]
11 : MAXPREDO Validation is corect
16 : MAXPREDO Validation are correct
Cosine similarity: [[0.39895808]]
12 : Move to QC
17 : Move to QC
Cosine similarity: [[1.]]
14 : MAXPREDO Validation is right
16 : MAXPREDO Validation are correct
Cosine similarity: [[0.39895808]]
15 : Verify files are sent every hours for this interface from Optima
18 : Verify files are not sent
Cosine similarity: [[0.44897995]]
可以使用这个 Python 3 库来计算句子相似度:https://github.com/UKPLab/sentence-transformers
来自 https://www.sbert.net/docs/usage/semantic_textual_similarity.html 的代码示例:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('paraphrase-MiniLM-L12-v2')
# Two lists of sentences
sentences1 = ['The cat sits outside',
'A man is playing guitar',
'The new movie is awesome']
sentences2 = ['The dog plays in the garden',
'A woman watches TV',
'The new movie is so great']
#Compute embedding for both lists
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)
#Compute cosine-similarits
cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings2)
#Output the pairs with their score
for i in range(len(sentences1)):
print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i]))
该库包含最先进的句子嵌入模型。
参见