为什么 Spark 的 Word2Vec return 是一个向量?

Why does Spark's Word2Vec return a vector?

运行 Spark's example for Word2Vec,我意识到它接受一个字符串数组并给出一个向量。我的问题是,它 return 不应该是矩阵而不是向量吗?我期待每个输入词一个向量。但它 return 是一个矢量周期!

或者也许它应该接受字符串,而不是字符串数组(一个单词)作为输入。然后,是的,它可以 return 一个向量作为输出。但是接受一组字符串并 returning 一个向量对我来说没有意义。

[更新]

根据@Shaido 的要求,我对代码进行了细微更改以打印输出模式:

public class JavaWord2VecExample {
    public static void main(String[] args) {
        SparkSession spark = SparkSession
                .builder()
                .appName("JavaWord2VecExample")
                .getOrCreate();

        // $example on$
        // Input data: Each row is a bag of words from a sentence or document.
        List<Row> data = Arrays.asList(
                RowFactory.create(Arrays.asList("Hi I heard about Spark".split(" "))),
                RowFactory.create(Arrays.asList("I wish Java could use case classes".split(" "))),
                RowFactory.create(Arrays.asList("Logistic regression models are neat".split(" ")))
        );
        StructType schema = new StructType(new StructField[]{
                new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
        });
        Dataset<Row> documentDF = spark.createDataFrame(data, schema);

        // Learn a mapping from words to Vectors.
        Word2Vec word2Vec = new Word2Vec()
                .setInputCol("text")
                .setOutputCol("result")
                .setVectorSize(7)
                .setMinCount(0);

        Word2VecModel model = word2Vec.fit(documentDF);
        Dataset<Row> result = model.transform(documentDF);

        for (Row row : result.collectAsList()) {
            List<String> text = row.getList(0);
            System.out.println("Schema: " + row.schema());
            Vector vector = (Vector) row.get(1);
            System.out.println("Text: " + text + " => \nVector: " + vector + "\n");
        }
        // $example off$

        spark.stop();
    }
}

并打印:

Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [Hi, I, heard, about, Spark] => 
Vector: [-0.0033279924420639875,-0.0024428479373455048,0.01406305879354477,0.030621735751628878,0.00792500376701355,0.02839711122214794,-0.02286271695047617]

Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [I, wish, Java, could, use, case, classes] => 
Vector: [-9.96453288410391E-4,-0.013741840076233658,0.013064394239336252,-0.01155538750546319,-0.010510949650779366,0.004538436819400106,-0.0036846946126648356]

Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [Logistic, regression, models, are, neat] => 
Vector: [0.012510885251685977,-0.014472834207117558,0.002779599279165268,0.0022389178164303304,0.012743516173213721,-0.02409198731184006,0.017409833287820222]

如果我错了请纠正我,但输入是一个字符串数组,输出是一个向量。我希望每个单词都映射到一个向量中。

要查看每个单词对应的向量,您可以 运行 model.getVectors。对于问题中的数据框(矢量大小为 3 而不是 7),这给出:

+----------+-----------------------------------------------------------------+
|word      |vector                                                           |
+----------+-----------------------------------------------------------------+
|heard     |[0.14950960874557495,-0.11237259954214096,-0.03993036597967148]  |
|are       |[-0.16390761733055115,-0.14509087800979614,0.11349033564329147]  |
|neat      |[0.13949351012706757,0.08127426356077194,0.15970033407211304]    |
|classes   |[0.03703496977686882,0.05841822177171707,-0.02267565205693245]   |
|I         |[-0.018915412947535515,-0.13099457323551178,0.14300788938999176] |
|regression|[0.1529865264892578,0.060659825801849365,0.07735282927751541]    |
|Logistic  |[-0.12702016532421112,0.09839040040969849,-0.10370948910713196]  |
|Spark     |[-0.053579315543174744,0.14673036336898804,-0.002033260650932789]|
|could     |[0.12216471135616302,-0.031169598922133446,-0.1427609771490097]  |
|use       |[0.08246973901987076,0.002503493567928672,-0.0796264186501503]   |
|Hi        |[0.16548289358615875,0.06477408856153488,0.09229831397533417]    |
|models    |[-0.05683165416121483,0.009706663899123669,-0.033789146691560745]|
|case      |[0.11626788973808289,0.10363516956567764,-0.07028932124376297]   |
|about     |[-0.1500445008277893,-0.049380943179130554,0.03307584300637245]  |
|Java      |[-0.04074851796030998,0.02809843420982361,-0.16281810402870178]  |
|wish      |[0.11882393807172775,0.13347993791103363,0.14399205148220062]    |
+----------+-----------------------------------------------------------------+

所以每个词都有它自己的表示。然而,当您向模型输入一个句子(字符串数组)时会发生什么,即句子中所有单词的向量都被平均在一起。

来自github implementation

/**
  * Transform a sentence column to a vector column to represent the whole sentence. The transform
  * is performed by averaging all word vectors it contains.
  */
 @Since("2.0.0")
 override def transform(dataset: Dataset[_]): DataFrame = {
 ...

这个很容易确认,例如:

Text: [Logistic, regression, models, are, neat] => 
Vector: [-0.011055880039930344,0.020988055132329465,0.042608972638845444]

第一个元素是通过取五个相关词的向量的第一个元素的平均值计算的,

(-0.12702016532421112 + 0.1529865264892578 -0.05683165416121483 -0.16390761733055115 + 0.13949351012706757) / 5

等于-0.011055880039930344.

这是在这里证明 Spark 基本原理的尝试,它应该被阅读为对已经作为答案提供的很好的 编程 解释的补充...

首先,原则上应该如何组合单个单词嵌入并不是 Word2Vec 模型本身的一个特征(它是关于 单个 个单词),但是"higher order" 模型关注的问题,例如 Sentence2Vec, Paragraph2Vec, Doc2Vec, Wikipedia2Vec 等(我想你可以举出更多的例子……)。

话虽如此,事实证明,为了获得较大文本片段(短语、句子、推文等)的向量表示,合并词向量的第一种方法确实是简单地平均向量表示构成词,就像 Spark ML 所做的那样。

从从业者社区出发,我们有:

(所以回答):

There are at least three common ways to combine embedding vectors; (a) summing, (b) summing & averaging or (c) concatenating. [...] See gensim.models.doc2vec.Doc2Vec, dm_concat and dm_mean - it allows you to use any of those three options

Sentence2Vec : Evaluation of popular theories — Part I (Simple average of word vectors)(博客post):

So what’s first thing that comes to your mind when you have word vectors and need to calculate sentence vector.

Just average them?

Yes that’s what we are going to do here.

Sentence2Vec(Github 回购):

Word2Vec can help to find other words with similar semantic meaning. However, Word2Vec can only take 1 word each time, while a sentence consists of multiple words. To solve this, I write the Sentence2Vec, which is actually a wrapper to Word2Vec. To obtain the vector of a sentence, I simply get the averaged vector sum of each word in the sentence.

显然,至少对于从业者而言,这种对单个词向量的简单平均并不出人意料。

此处预期的反驳论点是博客 post 和 SO 答案可以说不是 that 可靠的来源; 研究人员和相关的科学文献呢?嗯,事实证明,这种简单的平均在这里也很常见:

来自 Distributed Representations of Sentences and Documents(Le & Mikolov,Google,ICML 2014):

来自 NILC-USP at SemEval-2017 Task 4: A Multi-view Ensemble for Twitter Sentiment analysis(SemEval 2017,第 2.1.2 节):


现在应该很清楚了,Spark ML 中的特定设计选择远非随意,甚至不常见;我在博客上写了关于 Spark ML 中看起来 荒谬 设计选择的博客(请参阅 Classification in Spark 2.0: “Input validation failed” and other wondrous tales),但似乎并非如此...