Spark ML：DecisionTreeClassificatonModel 如何知道树的权重？

Question

我想从已保存（或未保存）DecisionTreeClassificationModel 中获取树节点的权重。但是我找不到任何类似的东西。

模型如何在不知道任何这些的情况下实际执行分类。以下是保存在模型中的参数：

{"class":"org.apache.spark.ml.classification.DecisionTreeClassificationModel"
"timestamp":1551207582648
"sparkVersion":"2.3.2"
"uid":"DecisionTreeClassifier_4ffc94d20f1ddb29f282"
"paramMap":{
"cacheNodeIds":false
"maxBins":32
"minInstancesPerNode":1
"predictionCol":"prediction"
"minInfoGain":0.0
"rawPredictionCol":"rawPrediction"
"featuresCol":"features"
"probabilityCol":"probability"
"checkpointInterval":10
"seed":956191873026065186
"impurity":"gini"
"maxMemoryInMB":256
"maxDepth":2
"labelCol":"indexed"
}
"numFeatures":1
"numClasses":2
}

Answer 1

通过使用 treeWeights:

treeWeights

Return the weights for each tree

New in version 1.5.0.

所以

How does the model actually perform the classification not knowing any of those.

权重被存储，只是不作为元数据的一部分。如果你有 model

from pyspark.ml.classification import RandomForestClassificationModel

model: RandomForestClassificationModel = ...

并将其保存到磁盘

path: str = ...

model.save(path)

您会看到作者创建了 treesMetadata 子目录。如果加载内容（默认编写器使用 Parquet）：

import os

trees_metadata = spark.read.parquet(os.path.join(path, "treesMetadata"))

您将看到以下结构：

trees_metadata.printSchema()

root
 |-- treeID: integer (nullable = true)
 |-- metadata: string (nullable = true)
 |-- weights: double (nullable = true)

其中 weights 列包含由 treeID 标识的树的权重。

类似节点数据存储在data子目录中（参见示例）：

spark.read.parquet(os.path.join(path, "data")).printSchema()

root
 |-- id: integer (nullable = true)
 |-- prediction: double (nullable = true)
 |-- impurity: double (nullable = true)
 |-- impurityStats: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- gain: double (nullable = true)
 |-- leftChild: integer (nullable = true)
 |-- rightChild: integer (nullable = true)
 |-- split: struct (nullable = true)
 |    |-- featureIndex: integer (nullable = true)
 |    |-- leftCategoriesOrThreshold: array (nullable = true)
 |    |    |-- element: double (containsNull = true)
 |    |-- numCategories: integer (nullable = true)

同样的信息（减去树数据和树权重）也可用于 DecisionTreeClassificationModel。

Spark ML：DecisionTreeClassificatonModel 如何知道树的权重？

Spark ML: How does DecisionTreeClassificatonModel know about the tree weights?

apache-spark

pyspark

apache-spark-ml