SparkML - 创建 RandomForestRegressionModel 的 df(feature, feature_importance)
SparkML - Creating a df(feature, feature_importance) of a RandomForestRegressionModel
我正在按以下方式训练随机森林模型:
//Indexer
val stringIndexers = categoricalColumns.map { colName =>
new StringIndexer()
.setInputCol(colName)
.setOutputCol(colName + "Idx")
.setHandleInvalid("keep")
.fit(training)
}
//HotEncoder
val encoders = featuresEnconding.map { colName =>
new OneHotEncoderEstimator()
.setInputCols(Array(colName + "Idx"))
.setOutputCols(Array(colName + "Enc"))
.setHandleInvalid("keep")
}
//Adding features into a feature vector column
val assembler = new VectorAssembler()
.setInputCols(featureColumns)
.setOutputCol("features")
val rf = new RandomForestRegressor()
.setLabelCol("label")
.setFeaturesCol("features")
val stepsRF = stringIndexers ++ encoders ++ Array(assembler, rf)
val pipelineRF = new Pipeline()
.setStages(stepsRF)
val paramGridRF = new ParamGridBuilder()
.addGrid(rf.maxBins, Array(800))
.addGrid(rf.featureSubsetStrategy, Array("all"))
.addGrid(rf.minInfoGain, Array(0.05))
.addGrid(rf.minInstancesPerNode, Array(1))
.addGrid(rf.maxDepth, Array(28,29,30))
.addGrid(rf.numTrees, Array(20))
.build()
//Defining the evaluator
val evaluatorRF = new RegressionEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
//Using cross validation to train the model
//Start with TrainSplit -Cross Validations taking so long so far
val cvRF = new CrossValidator()
.setEstimator(pipelineRF)
.setEvaluator(evaluatorRF)
.setEstimatorParamMaps(paramGridRF)
.setNumFolds(10)
.setParallelism(3)
//Fitting the model with our training dataset
val cvRFModel = cvRF.fit(training)
我现在想要的是在训练后得到模型中每个特征的重要性。
我能够将每个特征的重要性作为 Array[Double] 这样做:
val bestModel = cvRFModel.bestModel.asInstanceOf[PipelineModel]
val size = bestModel.stages.size-1
val ftrImp = bestModel.stages(size).asInstanceOf[RandomForestRegressionModel].featureImportances.toArray
但是我只得到了每个特征的重要性和一个数值索引,但我不知道我的模型中对应每个重要性值的特征名称是什么。
我还想提一下,由于我使用的是hotencoder,所以最终的特征量比原来的featureColumns数组大很多。
如何提取模型训练期间使用的特征名称?
我找到了这个可能的解决方案:
import org.apache.spark.ml.attribute._
val bestModel = cvRFModel.bestModel.asInstanceOf[PipelineModel]
val lstModel = bestModel.stages.last.asInstanceOf[RandomForestRegressionModel]
val schema = predictions.schema
val featureAttrs = AttributeGroup.fromStructField(schema(lstModel.getFeaturesCol)).attributes.get
val mfeatures = featureAttrs.map(_.name.get)
val mdf = sc.parallelize(mfeatures zip ftrImp).toDF("featureName","Importance")
.orderBy(desc("Importance"))
display(mdf)
我正在按以下方式训练随机森林模型:
//Indexer
val stringIndexers = categoricalColumns.map { colName =>
new StringIndexer()
.setInputCol(colName)
.setOutputCol(colName + "Idx")
.setHandleInvalid("keep")
.fit(training)
}
//HotEncoder
val encoders = featuresEnconding.map { colName =>
new OneHotEncoderEstimator()
.setInputCols(Array(colName + "Idx"))
.setOutputCols(Array(colName + "Enc"))
.setHandleInvalid("keep")
}
//Adding features into a feature vector column
val assembler = new VectorAssembler()
.setInputCols(featureColumns)
.setOutputCol("features")
val rf = new RandomForestRegressor()
.setLabelCol("label")
.setFeaturesCol("features")
val stepsRF = stringIndexers ++ encoders ++ Array(assembler, rf)
val pipelineRF = new Pipeline()
.setStages(stepsRF)
val paramGridRF = new ParamGridBuilder()
.addGrid(rf.maxBins, Array(800))
.addGrid(rf.featureSubsetStrategy, Array("all"))
.addGrid(rf.minInfoGain, Array(0.05))
.addGrid(rf.minInstancesPerNode, Array(1))
.addGrid(rf.maxDepth, Array(28,29,30))
.addGrid(rf.numTrees, Array(20))
.build()
//Defining the evaluator
val evaluatorRF = new RegressionEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
//Using cross validation to train the model
//Start with TrainSplit -Cross Validations taking so long so far
val cvRF = new CrossValidator()
.setEstimator(pipelineRF)
.setEvaluator(evaluatorRF)
.setEstimatorParamMaps(paramGridRF)
.setNumFolds(10)
.setParallelism(3)
//Fitting the model with our training dataset
val cvRFModel = cvRF.fit(training)
我现在想要的是在训练后得到模型中每个特征的重要性。
我能够将每个特征的重要性作为 Array[Double] 这样做:
val bestModel = cvRFModel.bestModel.asInstanceOf[PipelineModel]
val size = bestModel.stages.size-1
val ftrImp = bestModel.stages(size).asInstanceOf[RandomForestRegressionModel].featureImportances.toArray
但是我只得到了每个特征的重要性和一个数值索引,但我不知道我的模型中对应每个重要性值的特征名称是什么。
我还想提一下,由于我使用的是hotencoder,所以最终的特征量比原来的featureColumns数组大很多。
如何提取模型训练期间使用的特征名称?
我找到了这个可能的解决方案:
import org.apache.spark.ml.attribute._
val bestModel = cvRFModel.bestModel.asInstanceOf[PipelineModel]
val lstModel = bestModel.stages.last.asInstanceOf[RandomForestRegressionModel]
val schema = predictions.schema
val featureAttrs = AttributeGroup.fromStructField(schema(lstModel.getFeaturesCol)).attributes.get
val mfeatures = featureAttrs.map(_.name.get)
val mdf = sc.parallelize(mfeatures zip ftrImp).toDF("featureName","Importance")
.orderBy(desc("Importance"))
display(mdf)