如何为列表中的每个句子而不是整个列表创建嵌入？

Question

我需要为列表中的文档生成嵌入，计算语料库 1 的每个句子与语料库 2 的每个句子之间的余弦相似度，对它们进行排序并给出最佳拟合：

embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

embeddings1 = ["I'd like an apple juice",
                                "An apple a day keeps the doctor away",
                                 "Eat apple every day",
                                 "We buy apples every week",
                                 "We use machine learning for text classification",
                                 "Text classification is subfield of machine learning"]
embeddings1 = embed(embeddings1)

embeddings2 = ["I'd like an orange juice",
                                "An orange a day keeps the doctor away",
                                 "Eat orange every day",
                                 "We buy orange every week",
                                 "We use machine learning for document classification",
                                 "Text classification is some subfield of machine learning"]
embeddings2 = embed(embeddings2)

print(cosine_similarity(embeddings1, embeddings2))

向量似乎工作正常（由于数组的形状）以及余弦相似度的计算。我的问题是 Universal Sentence Encoder 没有给出它们各自的字符串，这很重要。它总是要找到合适的，我必须能够在 Cosine Similarity

的值之后订购

array([[ 0.7882168 ,  0.3366559 ,  0.22973989,  0.15428472, -0.10180502,
                                                         -0.04344492],
       [ 0.256085  ,  0.7713026 ,  0.32120776,  0.17834462, -0.10769081,
                                                         -0.09398925],
       [ 0.23850328,  0.446203  ,  0.62606746,  0.25242645, -0.03946173,
                                                         -0.00908459],
       [ 0.24337521,  0.35571027,  0.32963073,  0.6373588 ,  0.08571904,
                                                         -0.01240187],
       [-0.07001016, -0.12002315, -0.02002328,  0.09045915,  0.9141338 ,
                                                          0.8373743 ],
       [-0.04525191, -0.09421931, -0.00631144, -0.00199519,  0.75919366,
                                                          0.9686416 ]]

目标是代码自己找出第二个语料库中“I'd like an apple juice”的余弦相似度最高的是“I'd like an orange juice”，并匹配它们。

我尝试了循环，例如：

for sentence in embeddings1:
    print(sentence, embed(sentence))

导致此错误：

tensorflow.python.framework.errors_impl.InvalidArgumentError:  input must be a vector, got shape: []
     [[{{node StatefulPartitionedCall/StatefulPartitionedCall/text_preprocessor/tokenize/StringSplit/StringSplit}}]] [Op:__inference_restored_function_body_5285]

Function call stack:
restored_function_body

Answer 1

正如我在评论中提到的，您应该将for循环编写如下：

for sentence in embeddings1:
    print(sentence, embed([sentence]))

原因很简单，嵌入需要一个字符串列表作为输入。没有比这更详细的解释了。

如何为列表中的每个句子而不是整个列表创建嵌入？

How do I create embeddings for every sentence in a list and not for the list as a whole?

python

nlp

cosine-similarity

sentence-similarity

tensorflow