如何在 Spark Pipeline 中使用 RandomForest

How to use RandomForest in Spark Pipeline

我想通过网格搜索和 spark 交叉验证来调整我的模型。在spark中,它必须把base model放在一个pipeline中,office demo of pipeline使用LogistictRegression作为base model,可以new为一个object。但是,RandomForest 模型无法通过客户端代码 new,因此似乎无法在管道 api 中使用 RandomForest。我不想重新创建一个轮子,所以任何人都可以提供一些建议吗? 谢谢

However, the RandomForest model cannot be new by client code, so it seems not be able to use RandomForest in the pipeline api.

嗯,确实如此,但您只是想使用错误 class。您应该使用 ml.classification.RandomForestClassifier 而不是 mllib.tree.RandomForest。这是一个基于 the one from MLlib docs.

的示例
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.util.MLUtils
import sqlContext.implicits._ 

case class Record(category: String, features: Vector)

val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainData, testData) = (splits(0), splits(1))

val trainDF = trainData.map(lp => Record(lp.label.toString, lp.features)).toDF
val testDF = testData.map(lp => Record(lp.label.toString, lp.features)).toDF

val indexer = new StringIndexer()
  .setInputCol("category")
  .setOutputCol("label")

val rf  = new RandomForestClassifier()
    .setNumTrees(3)
    .setFeatureSubsetStrategy("auto")
    .setImpurity("gini")
    .setMaxDepth(4)
    .setMaxBins(32)

val pipeline = new Pipeline()
    .setStages(Array(indexer, rf))

val model = pipeline.fit(trainDF)

model.transform(testDF)

有一件事我在这里想不通。据我所知,应该可以直接使用从 LabeledPoints 中提取的标签,但由于某种原因它不起作用并且 pipeline.fit 引发 IllegalArgumentExcetion:

RandomForestClassifier was given input with invalid label column label, without the number of classes specified.

因此 StringIndexer 的恶作剧。应用后,我们获得了必需的属性 ({"vals":["1.0","0.0"],"type":"nominal","name":"label"}),但 ml 中的某些 classes 似乎没有它也能正常工作。