PySpark RandomForest 实现中如何计算 rawPrediction？

Question

我已经在包含 15 个示例的训练集上训练了一个 RF 模型（有 3 棵树，深度为 4）。下面是这三棵树的外观图像。我有两个类（比如 0 和 1）。

阈值在左侧分支中提到，而圆圈中的数字（例如 7、3 是特征 2 的 <= 阈值和 > 阈值的示例数量，即 f2）。

现在，当我尝试将模型应用于 10 个示例的测试集时，我不确定原始预测是如何计算的。

+-----+----+----+----------+-------------------------------------------------------------------
|prediction|features                                                                                                                                                                                                                                                                                                   |rawPrediction|probability                            |
+-----+----+----+----------+-----------------------------------------------------------------------------------------------------------+-------------+---------------------------------------+
|1.0       |[0.07707524933080619,0.03383458646616541,0.017208413001912046,9.0,2.5768015000258258,0.0,-1.0,-1.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0014143059186559938,0.0,0.6666666666666667,7.076533785087878E-4,0.0014163090128755495,0.9354143466934853,0.9333333333333333,0.875,0.938888892531395,7.0]                 |[1.0,2.0]    |[0.3333333333333333,0.6666666666666666]|

我已经通过以下链接进行了理解，但我无法理解这一点。

https://forums.databricks.com/questions/14355/how-does-randomforestclassifier-compute-the-rawpre.html

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala

我很确定这并不像您想象的那么简单。例如，根据我的理解，它不像 - 例如，如果两棵树预测为 0 而一棵树预测为 1，那么原始预测将是 [2, 1]。事实并非如此，因为当我在 500 个示例上训练模型时，我看到同一示例的原始预测为 [0.9552544653780279,2.0447455346219723].

有人可以向我解释一下这是如何从数学上计算出来的吗？在这里将不胜感激，因为它有点基础，我想直接了解它是如何工作的。再次非常感谢，如果需要任何其他信息来帮助解决此问题，请 post。

编辑： 也从模型中添加数据：

+------+-----------------------------------------------------------------------------------------------------+
|treeID|nodeData                                                                                             |
+------+-----------------------------------------------------------------------------------------------------+
|0     |[0, 0.0, 0.5, [9.0, 9.0], 0.19230769230769235, 1, 4, [2, [0.12519961673586713], -1]]                 |
|0     |[1, 0.0, 0.42603550295857984, [9.0, 4.0], 0.42603550295857984, 2, 3, [20, [0.39610389610389607], -1]]|
|0     |[2, 0.0, 0.0, [9.0, 0.0], -1.0, -1, -1, [-1, [], -1]]                                                |
|0     |[3, 1.0, 0.0, [0.0, 4.0], -1.0, -1, -1, [-1, [], -1]]                                                |
|0     |[4, 1.0, 0.0, [0.0, 5.0], -1.0, -1, -1, [-1, [], -1]]                                                |
|1     |[0, 1.0, 0.4444444444444444, [5.0, 10.0], 0.4444444444444444, 1, 2, [4, [0.9789660448762616], -1]]   |
|1     |[1, 1.0, 0.0, [0.0, 10.0], -1.0, -1, -1, [-1, [], -1]]                                               |
|1     |[2, 0.0, 0.0, [5.0, 0.0], -1.0, -1, -1, [-1, [], -1]]                                                |
|2     |[0, 0.0, 0.48, [3.0, 2.0], 0.48, 1, 2, [20, [0.3246753246753247], -1]]                               |
|2     |[1, 0.0, 0.0, [3.0, 0.0], -1.0, -1, -1, [-1, [], -1]]                                                |
|2     |[2, 1.0, 0.0, [0.0, 2.0], -1.0, -1, -1, [-1, [], -1]]                                                |
+------+-----------------------------------------------------------------------------------------------------+

Answer 1

原始预测是每棵树的预测 class 概率，对森林中的所有树求和。对于单个树的 class 概率，属于所选叶节点中每个 class 的样本数很重要。

在代码中，我们可以在RandomForestClassifierclasshere中看到这个程序，这里引用相关代码：

override protected def predictRaw(features: Vector): Vector = {
  // TODO: When we add a generic Bagging class, handle transform there: SPARK-7128
  // Classifies using majority votes.
  // Ignore the tree weights since all are 1.0 for now.
  val votes = Array.fill[Double](numClasses)(0.0)
  _trees.view.foreach { tree =>
    val classCounts: Array[Double] = tree.rootNode.predictImpl(features).impurityStats.stats
    val total = classCounts.sum
    if (total != 0) {
      var i = 0
      while (i < numClasses) {
        votes(i) += classCounts(i) / total
        i += 1
      }
    }
  }
  Vectors.dense(votes)
}

对于每棵树，我们找到对应于输入特征的叶子节点，并找到每个class的计数（class计数对应于[=30的训练样本数=] 在训练期间分配给叶节点）。 class 计数除以节点的总计数得到 class 概率。

现在，对于每棵树，我们有输入特征属于每个 class 的概率。这些概率相加得到原始预测。

这个问题更具体，根据我的理解，主要缺失的元素是 class 计数（以及 class 概率）。要计算原始预测，这些是必不可少的组成部分。在图像中，不是 "Pred 0" 和 "Pred 1"，而是需要在训练期间添加分配给每个叶子的样本数（"Pred 0" 意味着来自 class0为多数，反之亦然）。当你知道 class 计数和 class 概率时，将所有树的这些相加，你就会得到原始预测。

PySpark RandomForest 实现中如何计算 rawPrediction？

How is rawPrediction calculated in PySpark RandomForest implementation?

classification

random-forest

apache-spark

pyspark

apache-spark-mllib