Spark CrossValidatorModel 访问除 bestModel 之外的其他模型?
Spark CrossValidatorModel access other models than the bestModel?
我正在使用 Spark 1.6.1:
目前我正在使用 CrossValidator 使用各种参数训练我的 ML 管道。在训练过程之后,我可以使用 CrossValidatorModel 的 bestModel 属性 来获得在交叉验证期间表现最佳的模型。
交叉验证的其他模型是否会自动丢弃,或者我可以 select 一个比 bestModel 表现更差的模型?
我问是因为我正在使用 F1 分数指标进行交叉验证,但我也对所有模型的 weightedRecall 感兴趣,而不仅仅是在交叉验证期间表现最佳的模型
val folds = 6
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(new MulticlassClassificationEvaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(folds)
val avgF1Scores = cvModel.avgMetrics
val predictedDf = cvModel.bestModel.transform(testDf)
// Here I would like to predict as well with the other models of the cross validation
Spark >= 2.4.0 ( >= 2.3.0 in Scala)
SPARK-21088 CrossValidator,TrainValidationSplit 应在拟合时收集所有模型 - 添加对收集子模型的支持。
cv = CrossValidator(..., collectSubModels=True)
model = cv.fit(...)
model.subModels
Spark < 2.4
如果您想访问所有中间模型,则必须从头开始创建自定义交叉验证器。 o.a.s.ml.tuning.CrossValidator
丢弃其他模型,只将最好的模型和指标复制到 CrossValidatorModel
。
另见
如果你只是想做实验而不是生产实现,我推荐猴子补丁。这是我打印中间训练结果所做的。只需使用 CrossValidatorVerbose
作为 CrossValidator
.
的直接替代品
import numpy as np
from pyspark.ml.tuning import CrossValidator, CrossValidatorModel
from pyspark.sql.functions import rand
class CrossValidatorVerbose(CrossValidator):
def _fit(self, dataset):
est = self.getOrDefault(self.estimator)
epm = self.getOrDefault(self.estimatorParamMaps)
numModels = len(epm)
eva = self.getOrDefault(self.evaluator)
metricName = eva.getMetricName()
nFolds = self.getOrDefault(self.numFolds)
seed = self.getOrDefault(self.seed)
h = 1.0 / nFolds
randCol = self.uid + "_rand"
df = dataset.select("*", rand(seed).alias(randCol))
metrics = [0.0] * numModels
for i in range(nFolds):
foldNum = i + 1
print("Comparing models on fold %d" % foldNum)
validateLB = i * h
validateUB = (i + 1) * h
condition = (df[randCol] >= validateLB) & (df[randCol] < validateUB)
validation = df.filter(condition)
train = df.filter(~condition)
for j in range(numModels):
paramMap = epm[j]
model = est.fit(train, paramMap)
# TODO: duplicate evaluator to take extra params from input
metric = eva.evaluate(model.transform(validation, paramMap))
metrics[j] += metric
avgSoFar = metrics[j] / foldNum
print("params: %s\t%s: %f\tavg: %f" % (
{param.name: val for (param, val) in paramMap.items()},
metricName, metric, avgSoFar))
if eva.isLargerBetter():
bestIndex = np.argmax(metrics)
else:
bestIndex = np.argmin(metrics)
bestParams = epm[bestIndex]
bestModel = est.fit(dataset, bestParams)
avgMetrics = [m / nFolds for m in metrics]
bestAvg = avgMetrics[bestIndex]
print("Best model:\nparams: %s\t%s: %f" % (
{param.name: val for (param, val) in bestParams.items()},
metricName, bestAvg))
return self._copyValues(CrossValidatorModel(bestModel, avgMetrics))
注意:此解决方案还纠正了我在 v2.0.0 中看到的错误,其中 CrossValidationModel.avgMetrics 设置为指标总和而不是平均值.
这是 ALS
的简单 5 重验证的输出示例:
Comparing models on fold 1
params: {'regParam': 0.1, 'rank': 5, 'maxIter': 10} rmse: 1.122425 avg: 1.122425
params: {'regParam': 0.01, 'rank': 5, 'maxIter': 10} rmse: 1.123537 avg: 1.123537
params: {'regParam': 0.001, 'rank': 5, 'maxIter': 10} rmse: 1.123651 avg: 1.123651
Comparing models on fold 2
params: {'regParam': 0.1, 'rank': 5, 'maxIter': 10} rmse: 0.992541 avg: 1.057483
params: {'regParam': 0.01, 'rank': 5, 'maxIter': 10} rmse: 0.992541 avg: 1.058039
params: {'regParam': 0.001, 'rank': 5, 'maxIter': 10} rmse: 0.992541 avg: 1.058096
Comparing models on fold 3
params: {'regParam': 0.1, 'rank': 5, 'maxIter': 10} rmse: 1.141786 avg: 1.085584
params: {'regParam': 0.01, 'rank': 5, 'maxIter': 10} rmse: 1.141786 avg: 1.085955
params: {'regParam': 0.001, 'rank': 5, 'maxIter': 10} rmse: 1.141786 avg: 1.085993
Comparing models on fold 4
params: {'regParam': 0.1, 'rank': 5, 'maxIter': 10} rmse: 0.954110 avg: 1.052715
params: {'regParam': 0.01, 'rank': 5, 'maxIter': 10} rmse: 0.952955 avg: 1.052705
params: {'regParam': 0.001, 'rank': 5, 'maxIter': 10} rmse: 0.952873 avg: 1.052713
Comparing models on fold 5
params: {'regParam': 0.1, 'rank': 5, 'maxIter': 10} rmse: 1.140098 avg: 1.070192
params: {'regParam': 0.01, 'rank': 5, 'maxIter': 10} rmse: 1.139589 avg: 1.070082
params: {'regParam': 0.001, 'rank': 5, 'maxIter': 10} rmse: 1.139535 avg: 1.070077
Best model:
params: {'regParam': 0.001, 'rank': 5, 'maxIter': 10} rmse: 1.070077
我正在使用 Spark 1.6.1:
目前我正在使用 CrossValidator 使用各种参数训练我的 ML 管道。在训练过程之后,我可以使用 CrossValidatorModel 的 bestModel 属性 来获得在交叉验证期间表现最佳的模型。 交叉验证的其他模型是否会自动丢弃,或者我可以 select 一个比 bestModel 表现更差的模型?
我问是因为我正在使用 F1 分数指标进行交叉验证,但我也对所有模型的 weightedRecall 感兴趣,而不仅仅是在交叉验证期间表现最佳的模型
val folds = 6
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(new MulticlassClassificationEvaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(folds)
val avgF1Scores = cvModel.avgMetrics
val predictedDf = cvModel.bestModel.transform(testDf)
// Here I would like to predict as well with the other models of the cross validation
Spark >= 2.4.0 ( >= 2.3.0 in Scala)
SPARK-21088 CrossValidator,TrainValidationSplit 应在拟合时收集所有模型 - 添加对收集子模型的支持。
cv = CrossValidator(..., collectSubModels=True)
model = cv.fit(...)
model.subModels
Spark < 2.4
如果您想访问所有中间模型,则必须从头开始创建自定义交叉验证器。 o.a.s.ml.tuning.CrossValidator
丢弃其他模型,只将最好的模型和指标复制到 CrossValidatorModel
。
另见
如果你只是想做实验而不是生产实现,我推荐猴子补丁。这是我打印中间训练结果所做的。只需使用 CrossValidatorVerbose
作为 CrossValidator
.
import numpy as np
from pyspark.ml.tuning import CrossValidator, CrossValidatorModel
from pyspark.sql.functions import rand
class CrossValidatorVerbose(CrossValidator):
def _fit(self, dataset):
est = self.getOrDefault(self.estimator)
epm = self.getOrDefault(self.estimatorParamMaps)
numModels = len(epm)
eva = self.getOrDefault(self.evaluator)
metricName = eva.getMetricName()
nFolds = self.getOrDefault(self.numFolds)
seed = self.getOrDefault(self.seed)
h = 1.0 / nFolds
randCol = self.uid + "_rand"
df = dataset.select("*", rand(seed).alias(randCol))
metrics = [0.0] * numModels
for i in range(nFolds):
foldNum = i + 1
print("Comparing models on fold %d" % foldNum)
validateLB = i * h
validateUB = (i + 1) * h
condition = (df[randCol] >= validateLB) & (df[randCol] < validateUB)
validation = df.filter(condition)
train = df.filter(~condition)
for j in range(numModels):
paramMap = epm[j]
model = est.fit(train, paramMap)
# TODO: duplicate evaluator to take extra params from input
metric = eva.evaluate(model.transform(validation, paramMap))
metrics[j] += metric
avgSoFar = metrics[j] / foldNum
print("params: %s\t%s: %f\tavg: %f" % (
{param.name: val for (param, val) in paramMap.items()},
metricName, metric, avgSoFar))
if eva.isLargerBetter():
bestIndex = np.argmax(metrics)
else:
bestIndex = np.argmin(metrics)
bestParams = epm[bestIndex]
bestModel = est.fit(dataset, bestParams)
avgMetrics = [m / nFolds for m in metrics]
bestAvg = avgMetrics[bestIndex]
print("Best model:\nparams: %s\t%s: %f" % (
{param.name: val for (param, val) in bestParams.items()},
metricName, bestAvg))
return self._copyValues(CrossValidatorModel(bestModel, avgMetrics))
注意:此解决方案还纠正了我在 v2.0.0 中看到的错误,其中 CrossValidationModel.avgMetrics 设置为指标总和而不是平均值.
这是 ALS
的简单 5 重验证的输出示例:
Comparing models on fold 1
params: {'regParam': 0.1, 'rank': 5, 'maxIter': 10} rmse: 1.122425 avg: 1.122425
params: {'regParam': 0.01, 'rank': 5, 'maxIter': 10} rmse: 1.123537 avg: 1.123537
params: {'regParam': 0.001, 'rank': 5, 'maxIter': 10} rmse: 1.123651 avg: 1.123651
Comparing models on fold 2
params: {'regParam': 0.1, 'rank': 5, 'maxIter': 10} rmse: 0.992541 avg: 1.057483
params: {'regParam': 0.01, 'rank': 5, 'maxIter': 10} rmse: 0.992541 avg: 1.058039
params: {'regParam': 0.001, 'rank': 5, 'maxIter': 10} rmse: 0.992541 avg: 1.058096
Comparing models on fold 3
params: {'regParam': 0.1, 'rank': 5, 'maxIter': 10} rmse: 1.141786 avg: 1.085584
params: {'regParam': 0.01, 'rank': 5, 'maxIter': 10} rmse: 1.141786 avg: 1.085955
params: {'regParam': 0.001, 'rank': 5, 'maxIter': 10} rmse: 1.141786 avg: 1.085993
Comparing models on fold 4
params: {'regParam': 0.1, 'rank': 5, 'maxIter': 10} rmse: 0.954110 avg: 1.052715
params: {'regParam': 0.01, 'rank': 5, 'maxIter': 10} rmse: 0.952955 avg: 1.052705
params: {'regParam': 0.001, 'rank': 5, 'maxIter': 10} rmse: 0.952873 avg: 1.052713
Comparing models on fold 5
params: {'regParam': 0.1, 'rank': 5, 'maxIter': 10} rmse: 1.140098 avg: 1.070192
params: {'regParam': 0.01, 'rank': 5, 'maxIter': 10} rmse: 1.139589 avg: 1.070082
params: {'regParam': 0.001, 'rank': 5, 'maxIter': 10} rmse: 1.139535 avg: 1.070077
Best model:
params: {'regParam': 0.001, 'rank': 5, 'maxIter': 10} rmse: 1.070077