通过阈值使用通用句子编码器的句子相似度
Sentence similarity using universal sentence encoder by passing threshold
我有一个超过 1500 行的数据。每一行都有一个句子。我试图找出最好的方法来找到最相似的句子。我试过这个 但处理速度太慢,1500 行数据需要大约 20 分钟。
我已经使用了我上一个问题的代码并尝试了很多类型来提高速度但它并没有太大影响。我遇到了 universal sentence encoder using tensorflow,它看起来很快而且准确度很高。我正在开发 colab,你可以查看 here
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #@param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5", "https://tfhub.dev/google/universal-sentence-encoder-lite/2"]
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
return model(input)
df = pd.DataFrame(columns=["ID","DESCRIPTION"], data=np.matrix([[10,"Cancel ASN WMS Cancel ASN"],
[11,"MAXPREDO Validation is corect"],
[12,"Move to QC"],
[13,"Cancel ASN WMS Cancel ASN"],
[14,"MAXPREDO Validation is right"],
[15,"Verify files are sent every hours for this interface from Optima"],
[16,"MAXPREDO Validation are correct"],
[17,"Move to QC"],
[18,"Verify files are not sent"]
]))
message_embeddings = embed(messages)
for i, message_embedding in enumerate(np.array(message_embeddings).tolist()):
print("Message: {}".format(messages[i]))
print("Embedding size: {}".format(len(message_embedding)))
message_embedding_snippet = ", ".join(
(str(x) for x in message_embedding[:3]))
print("Embedding: [{}, ...]\n".format(message_embedding_snippet))
我在找什么
我想要一种方法,我可以通过阈值示例 0.90 的所有行中彼此相似的数据超过 0.90% 的结果应该被返回。
Data Sample
ID | DESCRIPTION
-----------------------------
10 | Cancel ASN WMS Cancel ASN
11 | MAXPREDO Validation is corect
12 | Move to QC
13 | Cancel ASN WMS Cancel ASN
14 | MAXPREDO Validation is right
15 | Verify files are sent every hours for this interface from Optima
16 | MAXPREDO Validation are correct
17 | Move to QC
18 | Verify files are not sent
预期结果
Above data which are similar upto 0.90% should get as a result with ID
ID | DESCRIPTION
-----------------------------
10 | Cancel ASN WMS Cancel ASN
13 | Cancel ASN WMS Cancel ASN
11 | MAXPREDO Validation is corect # even spelling is not correct
14 | MAXPREDO Validation is right
16 | MAXPREDO Validation are correct
12 | Move to QC
17 | Move to QC
您可以通过多种方式找到两个嵌入向量之间的相似性。
最常见的是 cosine_similarity
.
因此你要做的第一件事就是计算相似度矩阵:
代码:
message_embeddings = embed(list(df['DESCRIPTION']))
cos_sim = sklearn.metrics.pairwise.cosine_similarity(message_embeddings)
你得到一个具有相似值的 9*9
矩阵。
您可以创建此矩阵的热图以将其可视化。
代码:
def plot_similarity(labels, corr_matrix):
sns.set(font_scale=1.2)
g = sns.heatmap(
corr_matrix,
xticklabels=labels,
yticklabels=labels,
vmin=0,
vmax=1,
cmap="YlOrRd")
g.set_xticklabels(labels, rotation=90)
g.set_title("Semantic Textual Similarity")
plot_similarity(list(df['DESCRIPTION']), cos_sim)
输出:
颜色越深的方框越相似。
最后,您迭代此 cos_sim 矩阵以获得所有使用阈值的相似句子:
threshold = 0.8
row_index = []
for i in range(cos_sim.shape[0]):
if i in row_index:
continue
similar = [index for index in range(cos_sim.shape[1]) if (cos_sim[i][index] > threshold)]
if len(similar) > 1:
row_index += similar
sim_df = pd.DataFrame()
sim_df['ID'] = [df['ID'][i] for i in row_index]
sim_df['DESCRIPTION'] = [df['DESCRIPTION'][i] for i in row_index]
sim_df
数据框看起来像这样。
输出:
您可以使用不同的方法生成相似度矩阵。
您可以查看this了解更多方法。
我有一个超过 1500 行的数据。每一行都有一个句子。我试图找出最好的方法来找到最相似的句子。我试过这个
我已经使用了我上一个问题的代码并尝试了很多类型来提高速度但它并没有太大影响。我遇到了 universal sentence encoder using tensorflow,它看起来很快而且准确度很高。我正在开发 colab,你可以查看 here
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #@param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5", "https://tfhub.dev/google/universal-sentence-encoder-lite/2"]
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
return model(input)
df = pd.DataFrame(columns=["ID","DESCRIPTION"], data=np.matrix([[10,"Cancel ASN WMS Cancel ASN"],
[11,"MAXPREDO Validation is corect"],
[12,"Move to QC"],
[13,"Cancel ASN WMS Cancel ASN"],
[14,"MAXPREDO Validation is right"],
[15,"Verify files are sent every hours for this interface from Optima"],
[16,"MAXPREDO Validation are correct"],
[17,"Move to QC"],
[18,"Verify files are not sent"]
]))
message_embeddings = embed(messages)
for i, message_embedding in enumerate(np.array(message_embeddings).tolist()):
print("Message: {}".format(messages[i]))
print("Embedding size: {}".format(len(message_embedding)))
message_embedding_snippet = ", ".join(
(str(x) for x in message_embedding[:3]))
print("Embedding: [{}, ...]\n".format(message_embedding_snippet))
我在找什么
我想要一种方法,我可以通过阈值示例 0.90 的所有行中彼此相似的数据超过 0.90% 的结果应该被返回。
Data Sample
ID | DESCRIPTION
-----------------------------
10 | Cancel ASN WMS Cancel ASN
11 | MAXPREDO Validation is corect
12 | Move to QC
13 | Cancel ASN WMS Cancel ASN
14 | MAXPREDO Validation is right
15 | Verify files are sent every hours for this interface from Optima
16 | MAXPREDO Validation are correct
17 | Move to QC
18 | Verify files are not sent
预期结果
Above data which are similar upto 0.90% should get as a result with ID
ID | DESCRIPTION
-----------------------------
10 | Cancel ASN WMS Cancel ASN
13 | Cancel ASN WMS Cancel ASN
11 | MAXPREDO Validation is corect # even spelling is not correct
14 | MAXPREDO Validation is right
16 | MAXPREDO Validation are correct
12 | Move to QC
17 | Move to QC
您可以通过多种方式找到两个嵌入向量之间的相似性。
最常见的是 cosine_similarity
.
因此你要做的第一件事就是计算相似度矩阵:
代码:
message_embeddings = embed(list(df['DESCRIPTION']))
cos_sim = sklearn.metrics.pairwise.cosine_similarity(message_embeddings)
你得到一个具有相似值的 9*9
矩阵。
您可以创建此矩阵的热图以将其可视化。
代码:
def plot_similarity(labels, corr_matrix):
sns.set(font_scale=1.2)
g = sns.heatmap(
corr_matrix,
xticklabels=labels,
yticklabels=labels,
vmin=0,
vmax=1,
cmap="YlOrRd")
g.set_xticklabels(labels, rotation=90)
g.set_title("Semantic Textual Similarity")
plot_similarity(list(df['DESCRIPTION']), cos_sim)
输出:
颜色越深的方框越相似。
最后,您迭代此 cos_sim 矩阵以获得所有使用阈值的相似句子:
threshold = 0.8
row_index = []
for i in range(cos_sim.shape[0]):
if i in row_index:
continue
similar = [index for index in range(cos_sim.shape[1]) if (cos_sim[i][index] > threshold)]
if len(similar) > 1:
row_index += similar
sim_df = pd.DataFrame()
sim_df['ID'] = [df['ID'][i] for i in row_index]
sim_df['DESCRIPTION'] = [df['DESCRIPTION'][i] for i in row_index]
sim_df
数据框看起来像这样。
输出:
您可以使用不同的方法生成相似度矩阵。 您可以查看this了解更多方法。