如何为列表中的每个句子而不是整个列表创建嵌入?
How do I create embeddings for every sentence in a list and not for the list as a whole?
我需要为列表中的文档生成嵌入,计算语料库 1 的每个句子与语料库 2 的每个句子之间的余弦相似度,对它们进行排序并给出最佳拟合:
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
embeddings1 = ["I'd like an apple juice",
"An apple a day keeps the doctor away",
"Eat apple every day",
"We buy apples every week",
"We use machine learning for text classification",
"Text classification is subfield of machine learning"]
embeddings1 = embed(embeddings1)
embeddings2 = ["I'd like an orange juice",
"An orange a day keeps the doctor away",
"Eat orange every day",
"We buy orange every week",
"We use machine learning for document classification",
"Text classification is some subfield of machine learning"]
embeddings2 = embed(embeddings2)
print(cosine_similarity(embeddings1, embeddings2))
向量似乎工作正常(由于数组的形状)以及余弦相似度的计算。
我的问题是 Universal Sentence Encoder 没有给出它们各自的字符串,这很重要。它总是要找到合适的,我必须能够在 Cosine Similarity
的值之后订购
array([[ 0.7882168 , 0.3366559 , 0.22973989, 0.15428472, -0.10180502,
-0.04344492],
[ 0.256085 , 0.7713026 , 0.32120776, 0.17834462, -0.10769081,
-0.09398925],
[ 0.23850328, 0.446203 , 0.62606746, 0.25242645, -0.03946173,
-0.00908459],
[ 0.24337521, 0.35571027, 0.32963073, 0.6373588 , 0.08571904,
-0.01240187],
[-0.07001016, -0.12002315, -0.02002328, 0.09045915, 0.9141338 ,
0.8373743 ],
[-0.04525191, -0.09421931, -0.00631144, -0.00199519, 0.75919366,
0.9686416 ]]
目标是代码自己找出第二个语料库中“I'd like an apple juice”的余弦相似度最高的是“I'd like an orange juice”,并匹配它们。
我尝试了循环,例如:
for sentence in embeddings1:
print(sentence, embed(sentence))
导致此错误:
tensorflow.python.framework.errors_impl.InvalidArgumentError: input must be a vector, got shape: []
[[{{node StatefulPartitionedCall/StatefulPartitionedCall/text_preprocessor/tokenize/StringSplit/StringSplit}}]] [Op:__inference_restored_function_body_5285]
Function call stack:
restored_function_body
正如我在评论中提到的,您应该将for循环编写如下:
for sentence in embeddings1:
print(sentence, embed([sentence]))
原因很简单,嵌入需要一个字符串列表作为输入。没有比这更详细的解释了。
我需要为列表中的文档生成嵌入,计算语料库 1 的每个句子与语料库 2 的每个句子之间的余弦相似度,对它们进行排序并给出最佳拟合:
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
embeddings1 = ["I'd like an apple juice",
"An apple a day keeps the doctor away",
"Eat apple every day",
"We buy apples every week",
"We use machine learning for text classification",
"Text classification is subfield of machine learning"]
embeddings1 = embed(embeddings1)
embeddings2 = ["I'd like an orange juice",
"An orange a day keeps the doctor away",
"Eat orange every day",
"We buy orange every week",
"We use machine learning for document classification",
"Text classification is some subfield of machine learning"]
embeddings2 = embed(embeddings2)
print(cosine_similarity(embeddings1, embeddings2))
向量似乎工作正常(由于数组的形状)以及余弦相似度的计算。 我的问题是 Universal Sentence Encoder 没有给出它们各自的字符串,这很重要。它总是要找到合适的,我必须能够在 Cosine Similarity
的值之后订购array([[ 0.7882168 , 0.3366559 , 0.22973989, 0.15428472, -0.10180502,
-0.04344492],
[ 0.256085 , 0.7713026 , 0.32120776, 0.17834462, -0.10769081,
-0.09398925],
[ 0.23850328, 0.446203 , 0.62606746, 0.25242645, -0.03946173,
-0.00908459],
[ 0.24337521, 0.35571027, 0.32963073, 0.6373588 , 0.08571904,
-0.01240187],
[-0.07001016, -0.12002315, -0.02002328, 0.09045915, 0.9141338 ,
0.8373743 ],
[-0.04525191, -0.09421931, -0.00631144, -0.00199519, 0.75919366,
0.9686416 ]]
目标是代码自己找出第二个语料库中“I'd like an apple juice”的余弦相似度最高的是“I'd like an orange juice”,并匹配它们。
我尝试了循环,例如:
for sentence in embeddings1:
print(sentence, embed(sentence))
导致此错误:
tensorflow.python.framework.errors_impl.InvalidArgumentError: input must be a vector, got shape: []
[[{{node StatefulPartitionedCall/StatefulPartitionedCall/text_preprocessor/tokenize/StringSplit/StringSplit}}]] [Op:__inference_restored_function_body_5285]
Function call stack:
restored_function_body
正如我在评论中提到的,您应该将for循环编写如下:
for sentence in embeddings1:
print(sentence, embed([sentence]))
原因很简单,嵌入需要一个字符串列表作为输入。没有比这更详细的解释了。