对于某些行，转换为一个热向量的字符串索引为空（没有索引设置为 1）？

Question

我有一个带有分类列的 pyspark 数据框，正在通过...将其转换为 onehot 编码向量...

si = StringIndexer(inputCol="LABEL", outputCol="LABEL_IDX").fit(df)
df = si.transform(df)
oh = OneHotEncoderEstimator(inputCols=["LABEL_IDX"], outputCols=["LABEL_OH"]).fit(df)
df = oh.transform(df)

之后查看数据帧时，我看到一些 onehot 编码向量看起来像...

(1,[],[])

我希望稀疏向量看起来像 (1,[0],[1.0]) 或 (1,[1],[1.0])，但这里的向量只是零。

知道这里会发生什么吗？

Answer 1

这与值在 mllib 中的编码方式有关。 1hot 没有像这样对二进制值进行编码...

[1, 0] or [0, 1]

以[这个，那个]的方式而是

[1] or [0]

在稀疏向量格式中，[0] 情况看起来像 (1,[],[])，意思是长度 = 1，没有位置索引具有非零值，并且（因此）没有要列出的非零值（可以查看更多关于 mllib表示编码上的稀疏向量 here). So same as how a binary category only needs a single bit to represent both choices, the 1hot encoding uses a single index in the vector. From another article...

One Hot Encoding is very popular . We can represent all category by N-1 (N= No of Category) as that is sufficient to encode the one that is not included [... But note that] for classification recommendation is to use all N columns without as most of the tree based algorithm builds tree based on all available

如果您不希望 onehot 编码器删除最后一个类别以简化表示，mllib class 您可以设置一个 dropLast 参数，请参阅 https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoderEstimator

对于某些行，转换为一个热向量的字符串索引为空（没有索引设置为 1）？

String indexes converted to onehot vector are blank (no index set to 1) for some rows?

apache-spark-mllib