随机森林分析

Question

我有一个 Spark (1.5.2) DataFrame 和一个训练有素的 RandomForestClassificationModel。我可以轻松 fit 数据并进行预测，但我想更深入地分析每个二元分类场景中哪些边值是最常见的参与者。

过去，我做过与 RDD 类似的事情，通过自己计算预测来跟踪功能使用情况。在下面的代码中，我跟踪了用于计算预测的特征列表。在这方面，DataFrame 似乎不像 RDD 那样简单。

def predict(node:Node, features: Vector, path_in:Array[Int]) : (Double,Double,Array[Int]) = 
{
    if (node.isLeaf) 
    {
        (node.predict.predict,node.predict.prob,path_in)
    } 
    else
    {
        //track our path through the tree
        val path = path_in :+ node.split.get.feature

        if (node.split.get.featureType == FeatureType.Continuous) 
        {
            if (features(node.split.get.feature) <= node.split.get.threshold) 
            {
                predict(node.leftNode.get, features, path)
            } 
            else 
            {
                predict(node.rightNode.get, features, path)
            }
        } 
        else 
        {
            if (node.split.get.categories.contains(features(node.split.get.feature))) 
            {
                predict(node.leftNode.get, features, path)
            }
            else 
            {
                predict(node.rightNode.get, features, path)
            }
        }
    }
}

我想做一些与此代码类似的事情，但我 return 是所有 feature/edge 值对的列表，而不是每个特征向量。请注意，在我的数据集中，所有特征都是分类的，并且在构建森林时适当地使用了 bin 设置。

Answer 1

我最终构建了一个自定义 udf 来执行此操作：

//Base Prediction method. Accepts a Random Forest Model and a Feature Vector
//  Returns an Array of predictions, one per tree, the impurity, the feature used on the final edge, and the feature value.
def predicForest(m:RandomForestClassificationModel, point: Vector) : (Double, Array[(Double,Double,(Int,Double))])={
    val results = m.trees.map(t=> predict(t.rootNode,point))

    (results.map(x=> x._1).sum/results.count(x=> true), results)
}

def predict(node:Node, features: Vector) : (Double,Double,(Int,Double)) = {
    if (node.isInstanceOf[InternalNode]){
      //track our path through the tree
      val internalNode = node.asInstanceOf[InternalNode]
      if (internalNode.split.isInstanceOf[CategoricalSplit]) {
        val split = internalNode.split.asInstanceOf[CategoricalSplit]
        val featureValue = features(split.featureIndex)
        if (split.leftCategories.contains(featureValue)) {
          if (internalNode.leftChild.isInstanceOf[LeafNode]) {
            (node.prediction,node.impurity,(internalNode.split.featureIndex, featureValue))
          } else
            predict(internalNode.leftChild, features)
        } else {
          if (internalNode.rightChild.isInstanceOf[LeafNode]) {
            (node.prediction,node.impurity,(internalNode.split.featureIndex, featureValue))
          } else
            predict(internalNode.rightChild, features)
        }
      } else {
        //If we run into an unimplemented type we just return
        (node.prediction,node.impurity,(-1,-1))
      }
    } else {
      //If we run into an unimplemented type we just return
      (node.prediction,node.impurity,(-1,-1))
    }
}

val rfModel = yourInstanceOfRandomForestClassificationModel

//This custom UDF executes the Random Forest Classification in a trackable way
def treeAnalyzer(m:RandomForestClassificationModel) = udf((x:Vector) =>
  predicForest(m,x))

//Execute the UDF, this will execute the Random Forest classification on each row and store the results from each tree in a new column named `prediction`
val df3 = testData.withColumn("prediction", treeAnalyzer(rfModel)(testData("indexedFeatures")))

随机森林分析

Random Forest Analysis

scala

apache-spark

spark-dataframe

apache-spark-mllib