Spark ML:DecisionTreeClassificatonModel 如何知道树的权重?
Spark ML: How does DecisionTreeClassificatonModel know about the tree weights?
我想从已保存(或未保存)DecisionTreeClassificationModel
中获取树节点的权重。但是我找不到任何类似的东西。
模型如何在不知道任何这些的情况下实际执行分类。以下是保存在模型中的参数:
{"class":"org.apache.spark.ml.classification.DecisionTreeClassificationModel"
"timestamp":1551207582648
"sparkVersion":"2.3.2"
"uid":"DecisionTreeClassifier_4ffc94d20f1ddb29f282"
"paramMap":{
"cacheNodeIds":false
"maxBins":32
"minInstancesPerNode":1
"predictionCol":"prediction"
"minInfoGain":0.0
"rawPredictionCol":"rawPrediction"
"featuresCol":"features"
"probabilityCol":"probability"
"checkpointInterval":10
"seed":956191873026065186
"impurity":"gini"
"maxMemoryInMB":256
"maxDepth":2
"labelCol":"indexed"
}
"numFeatures":1
"numClasses":2
}
通过使用 treeWeights
:
treeWeights
Return the weights for each tree
New in version 1.5.0.
所以
How does the model actually perform the classification not knowing any of those.
权重被存储,只是不作为元数据的一部分。如果你有 model
from pyspark.ml.classification import RandomForestClassificationModel
model: RandomForestClassificationModel = ...
并将其保存到磁盘
path: str = ...
model.save(path)
您会看到作者创建了 treesMetadata
子目录。如果加载内容(默认编写器使用 Parquet):
import os
trees_metadata = spark.read.parquet(os.path.join(path, "treesMetadata"))
您将看到以下结构:
trees_metadata.printSchema()
root
|-- treeID: integer (nullable = true)
|-- metadata: string (nullable = true)
|-- weights: double (nullable = true)
其中 weights
列包含由 treeID
标识的树的权重。
类似节点数据存储在data
子目录中(参见示例):
spark.read.parquet(os.path.join(path, "data")).printSchema()
root
|-- id: integer (nullable = true)
|-- prediction: double (nullable = true)
|-- impurity: double (nullable = true)
|-- impurityStats: array (nullable = true)
| |-- element: double (containsNull = true)
|-- gain: double (nullable = true)
|-- leftChild: integer (nullable = true)
|-- rightChild: integer (nullable = true)
|-- split: struct (nullable = true)
| |-- featureIndex: integer (nullable = true)
| |-- leftCategoriesOrThreshold: array (nullable = true)
| | |-- element: double (containsNull = true)
| |-- numCategories: integer (nullable = true)
同样的信息(减去树数据和树权重)也可用于 DecisionTreeClassificationModel
。
我想从已保存(或未保存)DecisionTreeClassificationModel
中获取树节点的权重。但是我找不到任何类似的东西。
模型如何在不知道任何这些的情况下实际执行分类。以下是保存在模型中的参数:
{"class":"org.apache.spark.ml.classification.DecisionTreeClassificationModel"
"timestamp":1551207582648
"sparkVersion":"2.3.2"
"uid":"DecisionTreeClassifier_4ffc94d20f1ddb29f282"
"paramMap":{
"cacheNodeIds":false
"maxBins":32
"minInstancesPerNode":1
"predictionCol":"prediction"
"minInfoGain":0.0
"rawPredictionCol":"rawPrediction"
"featuresCol":"features"
"probabilityCol":"probability"
"checkpointInterval":10
"seed":956191873026065186
"impurity":"gini"
"maxMemoryInMB":256
"maxDepth":2
"labelCol":"indexed"
}
"numFeatures":1
"numClasses":2
}
通过使用 treeWeights
:
treeWeights
Return the weights for each tree
New in version 1.5.0.
所以
How does the model actually perform the classification not knowing any of those.
权重被存储,只是不作为元数据的一部分。如果你有 model
from pyspark.ml.classification import RandomForestClassificationModel
model: RandomForestClassificationModel = ...
并将其保存到磁盘
path: str = ...
model.save(path)
您会看到作者创建了 treesMetadata
子目录。如果加载内容(默认编写器使用 Parquet):
import os
trees_metadata = spark.read.parquet(os.path.join(path, "treesMetadata"))
您将看到以下结构:
trees_metadata.printSchema()
root
|-- treeID: integer (nullable = true)
|-- metadata: string (nullable = true)
|-- weights: double (nullable = true)
其中 weights
列包含由 treeID
标识的树的权重。
类似节点数据存储在data
子目录中(参见示例
spark.read.parquet(os.path.join(path, "data")).printSchema()
root
|-- id: integer (nullable = true)
|-- prediction: double (nullable = true)
|-- impurity: double (nullable = true)
|-- impurityStats: array (nullable = true)
| |-- element: double (containsNull = true)
|-- gain: double (nullable = true)
|-- leftChild: integer (nullable = true)
|-- rightChild: integer (nullable = true)
|-- split: struct (nullable = true)
| |-- featureIndex: integer (nullable = true)
| |-- leftCategoriesOrThreshold: array (nullable = true)
| | |-- element: double (containsNull = true)
| |-- numCategories: integer (nullable = true)
同样的信息(减去树数据和树权重)也可用于 DecisionTreeClassificationModel
。