对于某些行,转换为一个热向量的字符串索引为空(没有索引设置为 1)?
String indexes converted to onehot vector are blank (no index set to 1) for some rows?
我有一个带有分类列的 pyspark 数据框,正在通过...将其转换为 onehot 编码向量...
si = StringIndexer(inputCol="LABEL", outputCol="LABEL_IDX").fit(df)
df = si.transform(df)
oh = OneHotEncoderEstimator(inputCols=["LABEL_IDX"], outputCols=["LABEL_OH"]).fit(df)
df = oh.transform(df)
之后查看数据帧时,我看到一些 onehot 编码向量看起来像...
(1,[],[])
我希望稀疏向量看起来像 (1,[0],[1.0])
或 (1,[1],[1.0])
,但这里的向量只是零。
知道这里会发生什么吗?
这与值在 mllib 中的编码方式有关。
1hot 没有像这样对二进制值进行编码...
[1, 0] or [0, 1]
以[这个,那个]的方式而是
[1] or [0]
在稀疏向量格式中,[0] 情况看起来像 (1,[],[])
,意思是长度 = 1,没有位置索引具有非零值,并且(因此)没有要列出的非零值(可以查看更多关于 mllib表示编码上的稀疏向量 here). So same as how a binary category only needs a single bit to represent both choices, the 1hot encoding uses a single index in the vector. From another article...
One Hot Encoding is very popular . We can represent all category by N-1 (N= No of Category) as that is sufficient to encode the one that is not included [... But note that] for classification recommendation is to use all N columns without as most of the tree based algorithm builds tree based on all available
如果您不希望 onehot 编码器删除最后一个类别以简化表示,mllib class 您可以设置一个 dropLast 参数,请参阅 https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoderEstimator
我有一个带有分类列的 pyspark 数据框,正在通过...将其转换为 onehot 编码向量...
si = StringIndexer(inputCol="LABEL", outputCol="LABEL_IDX").fit(df)
df = si.transform(df)
oh = OneHotEncoderEstimator(inputCols=["LABEL_IDX"], outputCols=["LABEL_OH"]).fit(df)
df = oh.transform(df)
之后查看数据帧时,我看到一些 onehot 编码向量看起来像...
(1,[],[])
我希望稀疏向量看起来像 (1,[0],[1.0])
或 (1,[1],[1.0])
,但这里的向量只是零。
知道这里会发生什么吗?
这与值在 mllib 中的编码方式有关。 1hot 没有像这样对二进制值进行编码...
[1, 0] or [0, 1]
以[这个,那个]的方式而是
[1] or [0]
在稀疏向量格式中,[0] 情况看起来像 (1,[],[])
,意思是长度 = 1,没有位置索引具有非零值,并且(因此)没有要列出的非零值(可以查看更多关于 mllib表示编码上的稀疏向量 here). So same as how a binary category only needs a single bit to represent both choices, the 1hot encoding uses a single index in the vector. From another article...
One Hot Encoding is very popular . We can represent all category by N-1 (N= No of Category) as that is sufficient to encode the one that is not included [... But note that] for classification recommendation is to use all N columns without as most of the tree based algorithm builds tree based on all available
如果您不希望 onehot 编码器删除最后一个类别以简化表示,mllib class 您可以设置一个 dropLast 参数,请参阅 https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoderEstimator