如何在 pyspark.ml.tuning.TrainValidationSplit 调整后获得最佳参数?

How to get best params after tuning by pyspark.ml.tuning.TrainValidationSplit?

我正在尝试通过 TrainValidationSplit 调整 Spark (PySpark) ALS 模型的超参数。

效果很好,但我想知道哪种超参数组合最好。如何在评估后获得最佳参数?

from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator

df = sqlCtx.createDataFrame(
    [(0, 0, 4.0), (0, 1, 2.0), (1, 1, 3.0), (1, 2, 4.0), (2, 1, 1.0), (2, 2, 5.0)],
    ["user", "item", "rating"],
)

df_test = sqlCtx.createDataFrame(
    [(0, 0), (0, 1), (1, 1), (1, 2), (2, 1), (2, 2)],
    ["user", "item"],
)

als = ALS()

param_grid = ParamGridBuilder().addGrid(
    als.rank,
    [10, 15],
).addGrid(
    als.maxIter,
    [10, 15],
).build()

evaluator = RegressionEvaluator(
    metricName="rmse",
    labelCol="rating",
)
tvs = TrainValidationSplit(
    estimator=als,
    estimatorParamMaps=param_grid,
    evaluator=evaluator,
)


model = tvs.fit(df)

问题:如何获得best rank和maxIter?

您可以使用 bestModel property of the TrainValidationSplitModel 访问最佳模型:

best_model = model.bestModel

排名可以直接使用rank property of the ALSModel:

best_model.rank
10

获得最大迭代次数需要更多技巧:

(best_model
    ._java_obj     # Get Java object
    .parent()      # Get parent (ALS estimator)
    .getMaxIter()) # Get maxIter
10