如何在 Spark Pipeline 中使用 RandomForest
How to use RandomForest in Spark Pipeline
我想通过网格搜索和 spark 交叉验证来调整我的模型。在spark中,它必须把base model放在一个pipeline中,office demo of pipeline使用LogistictRegression
作为base model,可以new为一个object。但是,RandomForest
模型无法通过客户端代码 new,因此似乎无法在管道 api 中使用 RandomForest
。我不想重新创建一个轮子,所以任何人都可以提供一些建议吗?
谢谢
However, the RandomForest model cannot be new by client code, so it seems not be able to use RandomForest in the pipeline api.
嗯,确实如此,但您只是想使用错误 class。您应该使用 ml.classification.RandomForestClassifier
而不是 mllib.tree.RandomForest
。这是一个基于 the one from MLlib docs.
的示例
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.util.MLUtils
import sqlContext.implicits._
case class Record(category: String, features: Vector)
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainData, testData) = (splits(0), splits(1))
val trainDF = trainData.map(lp => Record(lp.label.toString, lp.features)).toDF
val testDF = testData.map(lp => Record(lp.label.toString, lp.features)).toDF
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("label")
val rf = new RandomForestClassifier()
.setNumTrees(3)
.setFeatureSubsetStrategy("auto")
.setImpurity("gini")
.setMaxDepth(4)
.setMaxBins(32)
val pipeline = new Pipeline()
.setStages(Array(indexer, rf))
val model = pipeline.fit(trainDF)
model.transform(testDF)
有一件事我在这里想不通。据我所知,应该可以直接使用从 LabeledPoints
中提取的标签,但由于某种原因它不起作用并且 pipeline.fit
引发 IllegalArgumentExcetion
:
RandomForestClassifier was given input with invalid label column label, without the number of classes specified.
因此 StringIndexer
的恶作剧。应用后,我们获得了必需的属性 ({"vals":["1.0","0.0"],"type":"nominal","name":"label"}
),但 ml
中的某些 classes 似乎没有它也能正常工作。
我想通过网格搜索和 spark 交叉验证来调整我的模型。在spark中,它必须把base model放在一个pipeline中,office demo of pipeline使用LogistictRegression
作为base model,可以new为一个object。但是,RandomForest
模型无法通过客户端代码 new,因此似乎无法在管道 api 中使用 RandomForest
。我不想重新创建一个轮子,所以任何人都可以提供一些建议吗?
谢谢
However, the RandomForest model cannot be new by client code, so it seems not be able to use RandomForest in the pipeline api.
嗯,确实如此,但您只是想使用错误 class。您应该使用 ml.classification.RandomForestClassifier
而不是 mllib.tree.RandomForest
。这是一个基于 the one from MLlib docs.
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.util.MLUtils
import sqlContext.implicits._
case class Record(category: String, features: Vector)
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainData, testData) = (splits(0), splits(1))
val trainDF = trainData.map(lp => Record(lp.label.toString, lp.features)).toDF
val testDF = testData.map(lp => Record(lp.label.toString, lp.features)).toDF
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("label")
val rf = new RandomForestClassifier()
.setNumTrees(3)
.setFeatureSubsetStrategy("auto")
.setImpurity("gini")
.setMaxDepth(4)
.setMaxBins(32)
val pipeline = new Pipeline()
.setStages(Array(indexer, rf))
val model = pipeline.fit(trainDF)
model.transform(testDF)
有一件事我在这里想不通。据我所知,应该可以直接使用从 LabeledPoints
中提取的标签,但由于某种原因它不起作用并且 pipeline.fit
引发 IllegalArgumentExcetion
:
RandomForestClassifier was given input with invalid label column label, without the number of classes specified.
因此 StringIndexer
的恶作剧。应用后,我们获得了必需的属性 ({"vals":["1.0","0.0"],"type":"nominal","name":"label"}
),但 ml
中的某些 classes 似乎没有它也能正常工作。