SPARK、ML、Tuning、CrossValidator:访问指标
SPARK, ML, Tuning, CrossValidator: access the metrics
为了构建一个 NaiveBayes 多类分类器,我使用 CrossValidator select 我管道中的最佳参数:
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEstimatorParamMaps(paramGrid)
.setEvaluator(new MulticlassClassificationEvaluator)
.setNumFolds(10)
val cvModel = cv.fit(trainingSet)
该管道按以下顺序包含常用的转换器和估计器:Tokenizer、StopWordsRemover、HashingTF、IDF,最后是 NaiveBayes。
是否可以访问为最佳模型计算的指标?
理想情况下,我想访问所有模型的指标,以查看更改参数如何改变分类质量。
但就目前而言,最好的模型就足够了。
仅供参考,我使用的是 Spark 1.6.0
这是我的做法:
val pipeline = new Pipeline()
.setStages(Array(tokenizer, stopWordsFilter, tf, idf, word2Vec, featureVectorAssembler, categoryIndexerModel, classifier, categoryReverseIndexer))
...
val paramGrid = new ParamGridBuilder()
.addGrid(tf.numFeatures, Array(10, 100))
.addGrid(idf.minDocFreq, Array(1, 10))
.addGrid(word2Vec.vectorSize, Array(200, 300))
.addGrid(classifier.maxDepth, Array(3, 5))
.build()
paramGrid.size // 16 entries
...
// Print the average metrics per ParamGrid entry
val avgMetricsParamGrid = crossValidatorModel.avgMetrics
// Combine with paramGrid to see how they affect the overall metrics
val combined = paramGrid.zip(avgMetricsParamGrid)
...
val bestModel = crossValidatorModel.bestModel.asInstanceOf[PipelineModel]
// Explain params for each stage
val bestHashingTFNumFeatures = bestModel.stages(2).asInstanceOf[HashingTF].explainParams
val bestIDFMinDocFrequency = bestModel.stages(3).asInstanceOf[IDFModel].explainParams
val bestWord2VecVectorSize = bestModel.stages(4).asInstanceOf[Word2VecModel].explainParams
val bestDecisionTreeDepth = bestModel.stages(7).asInstanceOf[DecisionTreeClassificationModel].explainParams
cvModel.avgMetrics
适用于 pyspark 2.2.0
为了构建一个 NaiveBayes 多类分类器,我使用 CrossValidator select 我管道中的最佳参数:
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEstimatorParamMaps(paramGrid)
.setEvaluator(new MulticlassClassificationEvaluator)
.setNumFolds(10)
val cvModel = cv.fit(trainingSet)
该管道按以下顺序包含常用的转换器和估计器:Tokenizer、StopWordsRemover、HashingTF、IDF,最后是 NaiveBayes。
是否可以访问为最佳模型计算的指标?
理想情况下,我想访问所有模型的指标,以查看更改参数如何改变分类质量。 但就目前而言,最好的模型就足够了。
仅供参考,我使用的是 Spark 1.6.0
这是我的做法:
val pipeline = new Pipeline()
.setStages(Array(tokenizer, stopWordsFilter, tf, idf, word2Vec, featureVectorAssembler, categoryIndexerModel, classifier, categoryReverseIndexer))
...
val paramGrid = new ParamGridBuilder()
.addGrid(tf.numFeatures, Array(10, 100))
.addGrid(idf.minDocFreq, Array(1, 10))
.addGrid(word2Vec.vectorSize, Array(200, 300))
.addGrid(classifier.maxDepth, Array(3, 5))
.build()
paramGrid.size // 16 entries
...
// Print the average metrics per ParamGrid entry
val avgMetricsParamGrid = crossValidatorModel.avgMetrics
// Combine with paramGrid to see how they affect the overall metrics
val combined = paramGrid.zip(avgMetricsParamGrid)
...
val bestModel = crossValidatorModel.bestModel.asInstanceOf[PipelineModel]
// Explain params for each stage
val bestHashingTFNumFeatures = bestModel.stages(2).asInstanceOf[HashingTF].explainParams
val bestIDFMinDocFrequency = bestModel.stages(3).asInstanceOf[IDFModel].explainParams
val bestWord2VecVectorSize = bestModel.stages(4).asInstanceOf[Word2VecModel].explainParams
val bestDecisionTreeDepth = bestModel.stages(7).asInstanceOf[DecisionTreeClassificationModel].explainParams
cvModel.avgMetrics
适用于 pyspark 2.2.0