Spark word2vec 示例解释以及如何获得字符串之间的相似性

Spark word2vec example explanation and how to get similarity between strings

我按照 Spark 文档页面中的示例使用 word2vec，link。它有效，但我不太明白它试图计算什么。

输出向量是输出字符串表示吗？

如果是，我尝试计算它们之间的余弦相似度，但我得到负值，因为向量不是正的。

Spark word2vec 可以创建纯正向量吗？

如何使用 Spark word2vec 计算字符串列表之间的相似度？

The output vector(by using transform on dataset) is a representation of the document(possibly sentence or sentences) which is supplied to the model .So; in essence this output is a combination of all the vector representation of each of the words in the given document(most likely a simple vector sum).

You can use findSynonyms to get "num" number of words closest in similarity to the given word. findSynonyms is based on cosine similarity only. Currently I am using it to generate feature Vectors which I am using as input to another model.

In order to compute similarity between two strings as some kind of a no. you would need to implement some variation of findSynonyms method.The current implementation generates a cosVec corresponding to input string and then tries to find the word Vecs which are closest to this vec .

I am not sure about the part whether it can create only positive vectors and whether it is at all required/(makes sense) to generate only positive vectors.

Spark word2vec 示例解释以及如何获得字符串之间的相似性

Spark word2vec example explanation and how to get similarity between strings

java

cosine-similarity

apache-spark

word2vec