如何解释 Spark OneHotEncoder 的结果

Question

我从 Spark 文档中阅读了 OHE 条目，

One-hot encoding maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.

但遗憾的是他们没有对 OHE 结果给出完整的解释。所以运行给定的代码：

from pyspark.ml.feature import OneHotEncoder, StringIndexer

df = sqlContext.createDataFrame([
(0, "a"),
(1, "b"),
(2, "c"),
(3, "a"),
(4, "a"),
(5, "c")
], ["id", "category"])

stringIndexer = StringIndexer(inputCol="category",      outputCol="categoryIndex")
model = stringIndexer.fit(df)
indexed = model.transform(df)

encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")
encoded = encoder.transform(indexed)
encoded.show()

得到结果：

   +---+--------+-------------+-------------+
   | id|category|categoryIndex|  categoryVec|
   +---+--------+-------------+-------------+
   |  0|       a|          0.0|(2,[0],[1.0])|
   |  1|       b|          2.0|    (2,[],[])|
   |  2|       c|          1.0|(2,[1],[1.0])|
   |  3|       a|          0.0|(2,[0],[1.0])|
   |  4|       a|          0.0|(2,[0],[1.0])|
   |  5|       c|          1.0|(2,[1],[1.0])|
   +---+--------+-------------+-------------+

我如何解释 OHE（最后一列）的结果？

Answer 1

One-hot 编码将 categoryIndex 中的值转换为二进制向量，其中最大一个值可能为 1。由于存在三个值，向量的长度为 2，映射为如下：

0  -> 10
1  -> 01
2  -> 00

（为什么映射是这样的？参见关于 one-hot 编码器删除最后一个类别。）

categoryVec 列中的值正是这些值，但以稀疏格式表示。在这种格式中，不打印向量的零点。第一个值 (2) 显示向量的长度，第二个值是一个数组，其中列出了零个或多个索引，其中找到了 non-zero 个条目。第三个值是另一个数组，它告诉我们在这些索引处找到了哪些数字。所以 (2,[0],[1.0]) 表示长度为 2 的向量，位置 0 为 1.0，其他位置为 0。

参见：https://spark.apache.org/docs/latest/mllib-data-types.html#local-vector

如何解释 Spark OneHotEncoder 的结果

How to interpret results of Spark OneHotEncoder

python

apache-spark

pyspark

one-hot-encoding