我如何从句子嵌入中排序向量并将它们与各自的输入一起给出?
How do I order vectors from sentence embeddings and give them out with their respective input?
我成功地为我的两个语料库中的每个句子生成了向量,并计算了每个可能对之间的余弦相似度(点积):
import tensorflow_hub as hub
from sklearn.metrics.pairwise import cosine_similarity
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
embeddings1 = ["I'd like an apple juice",
"An apple a day keeps the doctor away",
"Eat apple every day",
"We buy apples every week",
"We use machine learning for text classification",
"Text classification is subfield of machine learning"]
embeddings1 = embed(embeddings1)
embeddings2 = ["I'd like an orange juice",
"An orange a day keeps the doctor away",
"Eat orange every day",
"We buy orange every week",
"We use machine learning for document classification",
"Text classification is some subfield of machine learning"]
embeddings2 = embed(embeddings2)
print(cosine_similarity(embeddings1, embeddings2))
array([[ 0.7882168 , 0.3366559 , 0.22973989, 0.15428472, -0.10180502,
-0.04344492],
[ 0.256085 , 0.7713026 , 0.32120776, 0.17834462, -0.10769081,
-0.09398925],
[ 0.23850328, 0.446203 , 0.62606746, 0.25242645, -0.03946173,
-0.00908459],
[ 0.24337521, 0.35571027, 0.32963073, 0.6373588 , 0.08571904,
-0.01240187],
[-0.07001016, -0.12002315, -0.02002328, 0.09045915, 0.9141338 ,
0.8373743 ],
[-0.04525191, -0.09421931, -0.00631144, -0.00199519, 0.75919366,
0.9686416 ]]
为了获得有意义的输出,我需要对它们进行排序,然后 return 它们与相应的输入句子。有谁知道怎么做?我没有找到该任务的任何教程。
您可能会使用,np.argsort(...)
进行排序,
import tensorflow_hub as hub
from sklearn.metrics.pairwise import cosine_similarity
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
seq1 = ["I'd like an apple juice",
"An apple a day keeps the doctor away",
"Eat apple every day",
"We buy apples every week",
"We use machine learning for text classification",
"Text classification is subfield of machine learning"]
embeddings1 = embed(seq1)
seq2 = ["I'd like an orange juice",
"An orange a day keeps the doctor away",
"Eat orange every day",
"We buy orange every week",
"We use machine learning for document classification",
"Text classification is some subfield of machine learning"]
embeddings2 = embed(seq2)
a = cosine_similarity(embeddings1, embeddings2)
def get_pairs(a, b):
a = np.array(a)
b = np.array(b)
c = np.array(np.meshgrid(a, b))
c = c.T.reshape(len(a), -1, 2)
return c
pairs = get_pairs(seq1, seq2)
sorted_idx = np.argsort(a, axis=0)[..., None]
sorted_pairs = pairs[sorted_idx]
print(pairs[0, 0])
print(pairs[0, 1])
print(pairs[0, 2])
["I'd like an apple juice" "I'd like an orange juice"]
["I'd like an apple juice" 'An orange a day keeps the doctor away']
["I'd like an apple juice" 'Eat orange every day']
我传递的是字符串而不是字符串列表。问题已解决。
我成功地为我的两个语料库中的每个句子生成了向量,并计算了每个可能对之间的余弦相似度(点积):
import tensorflow_hub as hub
from sklearn.metrics.pairwise import cosine_similarity
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
embeddings1 = ["I'd like an apple juice",
"An apple a day keeps the doctor away",
"Eat apple every day",
"We buy apples every week",
"We use machine learning for text classification",
"Text classification is subfield of machine learning"]
embeddings1 = embed(embeddings1)
embeddings2 = ["I'd like an orange juice",
"An orange a day keeps the doctor away",
"Eat orange every day",
"We buy orange every week",
"We use machine learning for document classification",
"Text classification is some subfield of machine learning"]
embeddings2 = embed(embeddings2)
print(cosine_similarity(embeddings1, embeddings2))
array([[ 0.7882168 , 0.3366559 , 0.22973989, 0.15428472, -0.10180502,
-0.04344492],
[ 0.256085 , 0.7713026 , 0.32120776, 0.17834462, -0.10769081,
-0.09398925],
[ 0.23850328, 0.446203 , 0.62606746, 0.25242645, -0.03946173,
-0.00908459],
[ 0.24337521, 0.35571027, 0.32963073, 0.6373588 , 0.08571904,
-0.01240187],
[-0.07001016, -0.12002315, -0.02002328, 0.09045915, 0.9141338 ,
0.8373743 ],
[-0.04525191, -0.09421931, -0.00631144, -0.00199519, 0.75919366,
0.9686416 ]]
为了获得有意义的输出,我需要对它们进行排序,然后 return 它们与相应的输入句子。有谁知道怎么做?我没有找到该任务的任何教程。
您可能会使用,np.argsort(...)
进行排序,
import tensorflow_hub as hub
from sklearn.metrics.pairwise import cosine_similarity
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
seq1 = ["I'd like an apple juice",
"An apple a day keeps the doctor away",
"Eat apple every day",
"We buy apples every week",
"We use machine learning for text classification",
"Text classification is subfield of machine learning"]
embeddings1 = embed(seq1)
seq2 = ["I'd like an orange juice",
"An orange a day keeps the doctor away",
"Eat orange every day",
"We buy orange every week",
"We use machine learning for document classification",
"Text classification is some subfield of machine learning"]
embeddings2 = embed(seq2)
a = cosine_similarity(embeddings1, embeddings2)
def get_pairs(a, b):
a = np.array(a)
b = np.array(b)
c = np.array(np.meshgrid(a, b))
c = c.T.reshape(len(a), -1, 2)
return c
pairs = get_pairs(seq1, seq2)
sorted_idx = np.argsort(a, axis=0)[..., None]
sorted_pairs = pairs[sorted_idx]
print(pairs[0, 0])
print(pairs[0, 1])
print(pairs[0, 2])
["I'd like an apple juice" "I'd like an orange juice"]
["I'd like an apple juice" 'An orange a day keeps the doctor away']
["I'd like an apple juice" 'Eat orange every day']
我传递的是字符串而不是字符串列表。问题已解决。