Spark MLlib:包括分类特征
Spark MLlib: Including categorical features
将分类变量(字符串和整数)包含到 MLlib 算法的特征中的正确或最佳方法是什么?
在分类变量上使用 OneHotEncoder
,然后将输出列与其他列一起包含在 VectorAssembler
中是否正确,如下面的代码所示?
原因是我最终得到一个数据框,其中包含这样的行,看起来 feature3
和 feature4
组合起来看起来它们具有相同的 'level' 重要性单独作为两个分类特征。
+------------------+-----------------------+---------------------------+
|prediction |actualVal |features |
+------------------+-----------------------+---------------------------+
|355416.44924898935|990000.0 |(17,[0,1,2,3,4,5,10,15],[1.0,206.0]) |
|358917.32988024893|210000.0 |(17,[0,1,2,3,4,5,10,15,16],[1.0,172.0]) |
|291313.84175674635|4600000.0 |(17,[0,1,2,3,4,5,12,15,16],[1.0,239.0]) |
这是我的代码:
val indexer = new StringIndexer()
.setInputCol("stringFeatureCode")
.setOutputCol("stringFeatureCodeIndex")
.fit(data)
val indexed = indexer.transform(data)
val encoder = new OneHotEncoder()
.setInputCol("stringFeatureCodeIndex")
.setOutputCol("stringFeatureCodeVec")
var encoded = encoder.transform(indexed)
encoded = encoded.withColumn("intFeatureCodeTmp", encoded.col("intFeatureCode")
.cast(DoubleType))
.drop("intFeatureCode")
.withColumnRenamed("intFeatureCodeTmp", "intFeatureCode")
val intFeatureCodeEncoder = new OneHotEncoder()
.setInputCol("intFeatureCode")
.setOutputCol("intFeatureCodeVec")
encoded = intFeatureCodeEncoder.transform(encoded)
val assemblerDeparture =
new VectorAssembler()
.setInputCols(
Array("stringFeatureCodeVec", "intFeatureCodeVec", "feature3", "feature4"))
.setOutputCol("features")
var data2 = assemblerDeparture.transform(encoded)
val Array(trainingData, testData) = data2.randomSplit(Array(0.7, 0.3))
val rf = new RandomForestRegressor()
.setLabelCol("actualVal")
.setFeaturesCol("features")
.setNumTrees(100)
- 一般来说这是推荐的方法。
- 当工作树建模时,它是不必要的,应该避免。您只能使用
StringIndexer
。
将分类变量(字符串和整数)包含到 MLlib 算法的特征中的正确或最佳方法是什么?
在分类变量上使用 OneHotEncoder
,然后将输出列与其他列一起包含在 VectorAssembler
中是否正确,如下面的代码所示?
原因是我最终得到一个数据框,其中包含这样的行,看起来 feature3
和 feature4
组合起来看起来它们具有相同的 'level' 重要性单独作为两个分类特征。
+------------------+-----------------------+---------------------------+
|prediction |actualVal |features |
+------------------+-----------------------+---------------------------+
|355416.44924898935|990000.0 |(17,[0,1,2,3,4,5,10,15],[1.0,206.0]) |
|358917.32988024893|210000.0 |(17,[0,1,2,3,4,5,10,15,16],[1.0,172.0]) |
|291313.84175674635|4600000.0 |(17,[0,1,2,3,4,5,12,15,16],[1.0,239.0]) |
这是我的代码:
val indexer = new StringIndexer()
.setInputCol("stringFeatureCode")
.setOutputCol("stringFeatureCodeIndex")
.fit(data)
val indexed = indexer.transform(data)
val encoder = new OneHotEncoder()
.setInputCol("stringFeatureCodeIndex")
.setOutputCol("stringFeatureCodeVec")
var encoded = encoder.transform(indexed)
encoded = encoded.withColumn("intFeatureCodeTmp", encoded.col("intFeatureCode")
.cast(DoubleType))
.drop("intFeatureCode")
.withColumnRenamed("intFeatureCodeTmp", "intFeatureCode")
val intFeatureCodeEncoder = new OneHotEncoder()
.setInputCol("intFeatureCode")
.setOutputCol("intFeatureCodeVec")
encoded = intFeatureCodeEncoder.transform(encoded)
val assemblerDeparture =
new VectorAssembler()
.setInputCols(
Array("stringFeatureCodeVec", "intFeatureCodeVec", "feature3", "feature4"))
.setOutputCol("features")
var data2 = assemblerDeparture.transform(encoded)
val Array(trainingData, testData) = data2.randomSplit(Array(0.7, 0.3))
val rf = new RandomForestRegressor()
.setLabelCol("actualVal")
.setFeaturesCol("features")
.setNumTrees(100)
- 一般来说这是推荐的方法。
- 当工作树建模时,它是不必要的,应该避免。您只能使用
StringIndexer
。