Pyspark ML：如何使用 CrossValidator() 获取子模型值

Question

我想获得 cross-validation 的（内部）训练精度，使用 PySpark 结束 ML 库：

lr = LogisticRegression()
param_grid = (ParamGridBuilder()
                     .addGrid(lr.regParam, [0.01, 0.5])
                     .addGrid(lr.maxIter, [5, 10])
                     .addGrid(lr.elasticNetParam, [0.01, 0.1])
                     .build())
evaluator = MulticlassClassificationEvaluator(predictionCol='prediction')
cv = CrossValidator(estimator=lr, 
                    estimatorParamMaps=param_grid, 
                    evaluator=evaluator, 
                    numFolds=5)
model_cv = cv.fit(train)
predictions_lr = model_cv.transform(validation)
predictions = evaluator.evaluate(predictions_lr)

为了获取每个 c.v. 文件夹的准确度指标，我尝试了：

print(model_cv.subModels)

但此方法的结果为空 (None)。

如何获取每个文件夹的accuracy？

Answer 1

我知道这是旧的，但以防万一有人正在寻找，为了在 cross-validation 过程中保存 non-best 模型，需要在创建时启用子模型集合一个CrossValidator。只需将值设置为 True（默认情况下为 False）。

即

CrossValidator(estimator=lr, 
               estimatorParamMaps=param_grid, 
               evaluator=evaluator, 
               numFolds=5,
               collectSubModels=True)

Pyspark ML：如何使用 CrossValidator() 获取子模型值

Pyspark ML: how to get subModels values with CrossValidator()

apache-spark

pyspark

k-fold