对于某些行,转换为一个热向量的字符串索引为空(没有索引设置为 1)?

String indexes converted to onehot vector are blank (no index set to 1) for some rows?

我有一个带有分类列的 pyspark 数据框,正在通过...将其转换为 onehot 编码向量...

si = StringIndexer(inputCol="LABEL", outputCol="LABEL_IDX").fit(df)
df = si.transform(df)
oh = OneHotEncoderEstimator(inputCols=["LABEL_IDX"], outputCols=["LABEL_OH"]).fit(df)
df = oh.transform(df)

之后查看数据帧时,我看到一些 onehot 编码向量看起来像...

(1,[],[])

我希望稀疏向量看起来像 (1,[0],[1.0])(1,[1],[1.0]),但这里的向量只是零。

知道这里会发生什么吗?

这与值在 mllib 中的编码方式有关。 1hot 没有像这样对二进制值进行编码...

[1, 0] or [0, 1]

以[这个,那个]的方式而是

[1] or [0]

在稀疏向量格式中,[0] 情况看起来像 (1,[],[]),意思是长度 = 1,没有位置索引具有非零值,并且(因此)没有要列出的非零值(可以查看更多关于 mllib表示编码上的稀疏向量 here). So same as how a binary category only needs a single bit to represent both choices, the 1hot encoding uses a single index in the vector. From another article...

One Hot Encoding is very popular . We can represent all category by N-1 (N= No of Category) as that is sufficient to encode the one that is not included [... But note that] for classification recommendation is to use all N columns without as most of the tree based algorithm builds tree based on all available

如果您不希望 onehot 编码器删除最后一个类别以简化表示,mllib class 您可以设置一个 dropLast 参数,请参阅 https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoderEstimator