如何在 pyspark.ml.tuning.TrainValidationSplit 调整后获得最佳参数?
How to get best params after tuning by pyspark.ml.tuning.TrainValidationSplit?
我正在尝试通过 TrainValidationSplit
调整 Spark (PySpark) ALS
模型的超参数。
效果很好,但我想知道哪种超参数组合最好。如何在评估后获得最佳参数?
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator
df = sqlCtx.createDataFrame(
[(0, 0, 4.0), (0, 1, 2.0), (1, 1, 3.0), (1, 2, 4.0), (2, 1, 1.0), (2, 2, 5.0)],
["user", "item", "rating"],
)
df_test = sqlCtx.createDataFrame(
[(0, 0), (0, 1), (1, 1), (1, 2), (2, 1), (2, 2)],
["user", "item"],
)
als = ALS()
param_grid = ParamGridBuilder().addGrid(
als.rank,
[10, 15],
).addGrid(
als.maxIter,
[10, 15],
).build()
evaluator = RegressionEvaluator(
metricName="rmse",
labelCol="rating",
)
tvs = TrainValidationSplit(
estimator=als,
estimatorParamMaps=param_grid,
evaluator=evaluator,
)
model = tvs.fit(df)
问题:如何获得best rank和maxIter?
您可以使用 bestModel
property of the TrainValidationSplitModel
访问最佳模型:
best_model = model.bestModel
排名可以直接使用rank
property of the ALSModel
:
best_model.rank
10
获得最大迭代次数需要更多技巧:
(best_model
._java_obj # Get Java object
.parent() # Get parent (ALS estimator)
.getMaxIter()) # Get maxIter
10
我正在尝试通过 TrainValidationSplit
调整 Spark (PySpark) ALS
模型的超参数。
效果很好,但我想知道哪种超参数组合最好。如何在评估后获得最佳参数?
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator
df = sqlCtx.createDataFrame(
[(0, 0, 4.0), (0, 1, 2.0), (1, 1, 3.0), (1, 2, 4.0), (2, 1, 1.0), (2, 2, 5.0)],
["user", "item", "rating"],
)
df_test = sqlCtx.createDataFrame(
[(0, 0), (0, 1), (1, 1), (1, 2), (2, 1), (2, 2)],
["user", "item"],
)
als = ALS()
param_grid = ParamGridBuilder().addGrid(
als.rank,
[10, 15],
).addGrid(
als.maxIter,
[10, 15],
).build()
evaluator = RegressionEvaluator(
metricName="rmse",
labelCol="rating",
)
tvs = TrainValidationSplit(
estimator=als,
estimatorParamMaps=param_grid,
evaluator=evaluator,
)
model = tvs.fit(df)
问题:如何获得best rank和maxIter?
您可以使用 bestModel
property of the TrainValidationSplitModel
访问最佳模型:
best_model = model.bestModel
排名可以直接使用rank
property of the ALSModel
:
best_model.rank
10
获得最大迭代次数需要更多技巧:
(best_model
._java_obj # Get Java object
.parent() # Get parent (ALS estimator)
.getMaxIter()) # Get maxIter
10