获取 Spark MLlib 决策树中每片叶子的默认元素数

Get the default number of elements per leaf in a Decision Tree of Spark MLlib

如果可能的话，我想获取 Spark MLlib 决策树中每片叶子的默认元素数。

我一直在阅读这里 https://spark.apache.org/docs/latest/mllib-decision-tree.html and also trying to find something in https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/Node.scala，但找不到我需要的信息。

我知道 minInstancesPerNode 策略参数，但这不是我想要的。

有什么想法吗？谢谢！

Spark DecisionTreeClassifier 有几个参数，您可以在训练时间之前使用 setZYZ 方法设置这些参数。许多方法将帮助您规范树并避免过度拟合。例如

setMinInstancesPerNode：node/leaf 中必须存在的最小训练记录数才有效。 node/leaf 少于 minInstances 它将被汇总到父
setMaxDepth: 树停止生长的最大深度。
setMinInfoGain：分裂发生的最小信息增益

训练 (.fit) Spark 决策树然后预测 (.transform) 后，您的 DataFrame 中将有 3 个额外的列（用于 classification）：

predictionCol: "Predicted label"
rawPredictionCol: "Vector of length # classes, with the counts of training instance labels at the tree node which makes the prediction"
probabilityCol: "Vector of length # classes equal to rawPrediction normalized to a multinomial distribution"

rawPredictionCol 栏可能就是您要查找的内容。它告诉您在训练时构建树后，每个 class 有多少个实例最终出现在叶子中。预测标签是计数最高的 class。 probabilityCol（源自 rawPredictionCol）捕获预测中的 "confidence"。参见：https://spark.apache.org/docs/latest/ml-classification-regression.html#output-columns

获取 Spark MLlib 决策树中每片叶子的默认元素数

Get the default number of elements per leaf in a Decision Tree of Spark MLlib

scala

decision-tree

apache-spark

apache-spark-mllib