如何从 PySpark 中的 spark.ml 中提取模型超参数?
How to extract model hyper-parameters from spark.ml in PySpark?
我正在修改 PySpark 文档中的一些交叉验证代码,并试图让 PySpark 告诉我选择了哪个模型:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
dataset = sqlContext.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
(Vectors.dense([0.4]), 1.0),
(Vectors.dense([0.5]), 0.0),
(Vectors.dense([0.6]), 1.0),
(Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
运行 在 PySpark shell 中,我可以获得线性回归模型的系数,但我似乎无法找到交叉验证程序选择的 lr.regParam
的值.有什么想法吗?
In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])
In [4]: cvModel.bestModel.explainParams()
Out[4]: ''
In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}
In [15]: cvModel.params
Out[15]: []
In [36]: cvModel.bestModel.params
Out[36]: []
我也撞到了这堵墙上,不幸的是,您只能获得 特定 模型的 特定 参数。幸运的是,对于逻辑回归,您可以访问截距和权重,遗憾的是您无法检索 regParam。
这可以通过以下方式完成:
best_lr = cv.bestModel
#get weigths
best_lr.weights
>>>DenseVector([3.1573])
#or better
best_lr.coefficients
>>>DenseVector([3.1573])
#get intercept
best_lr.intercept
>>>-1.0829958115287153
正如我之前所写,每个模型都有很少的参数可以提取。
总体而言,从管道获取相关模型(例如,当交叉验证器在管道上运行时 cv.bestModel)可以通过以下方式完成:
best_pipeline = cv.bestModel
best_pipeline.stages
>>>[Tokenizer_4bc8884ad68b4297fd3c,CountVectorizer_411fbdeb4100c2bfe8ef, PCA_4c538d67e7b8f29ff8d0,LogisticRegression_4db49954edc7033edc76]
每个模型都是通过简单的列表索引得到的
best_lr = best_pipeline.stages[3]
现在可以应用上面的内容了。
运行也是这个问题。我发现出于某种我不知道为什么的原因,您需要调用 java 属性。所以就这样做:
from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder, CrossValidator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(metricName="mae")
lr = LinearRegression()
grid = ParamGridBuilder().addGrid(lr.maxIter, [500]) \
.addGrid(lr.regParam, [0]) \
.addGrid(lr.elasticNetParam, [1]) \
.build()
lr_cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, \
evaluator=evaluator, numFolds=3)
lrModel = lr_cv.fit(your_training_set_here)
bestModel = lrModel.bestModel
打印出你想要的参数:
>>> print 'Best Param (regParam): ', bestModel._java_obj.getRegParam()
0
>>> print 'Best Param (MaxIter): ', bestModel._java_obj.getMaxIter()
500
>>> print 'Best Param (elasticNetParam): ', bestModel._java_obj.getElasticNetParam()
1
这也适用于 extractParamMap()
等其他方法。他们应该尽快解决这个问题。
假设 cvModel3Day 是您的模型名称,可以在 Spark Scala 中提取参数,如下所示
val params = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].extractParamMap()
val depth = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getMaxDepth
val iter = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getMaxIter
val bins = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getMaxBins
val features = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getFeaturesCol
val step = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getStepSize
val samplingRate = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getSubsamplingRate
其实有两个问题:
- 拟合模型的哪些方面(如系数和截距)
bestModel
所使用的元参数是什么。
不幸的是,拟合估计器(模型)的 python api 不允许(容易)直接访问估计器的参数,这使得很难回答后一个问题.
不过,使用 java api 有一个解决方法。为了完整起见,首先完整设置交叉验证模型
%pyspark
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
logit = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[logit])
paramGrid = ParamGridBuilder() \
.addGrid(logit.regParam, [0, 0.01, 0.05, 0.1, 0.5, 1]) \
.addGrid(logit.elasticNetParam, [0.0, 0.1, 0.5, 0.8, 1]) \
.build()
evaluator = BinaryClassificationEvaluator(metricName = 'areaUnderPR')
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=5)
tuned_model = crossval.fit(train)
model = tuned_model.bestModel
然后可以在 java 对象上使用通用方法来获取参数值,而无需显式引用 getRegParam()
:
等方法
java_model = model.stages[-1]._java_obj
{param.name: java_model.getOrDefault(java_model.getParam(param.name))
for param in paramGrid[0]}
这将执行以下步骤:
- 获取估计器从最佳模型的最后阶段创建的拟合 logit model:
crossval.fit(..).bestModel.stages[-1]
- 从
_java_obj
获取内部 java 对象
- 从
paramGrid
(这是一个字典列表)中获取所有配置的名称。仅使用第一行,假设它是一个实际的网格,因为每一行都包含相同的键。否则,您需要收集任何行中曾经使用过的所有名称。
- 从java对象中获取对应的
Param<T>
参数标识符。
- 将
Param<T>
实例传给getOrDefault()
函数得到实际值
这花了几分钟来破译,但我想通了。
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
# prenotation: I've built out my model already and I am calling the validator ParamGridBuilder
paramGrid = ParamGridBuilder() \
.addGrid(hashingTF.numFeatures, [1000]) \
.addGrid(linearSVC.regParam, [0.1, 0.01]) \
.addGrid(linearSVC.maxIter, [10, 20, 30]) \
.build()
crossval = CrossValidator(estimator=pipeline,\
estimatorParamMaps=paramGrid,\
evaluator=MulticlassClassificationEvaluator(),\
numFolds=2)
cvModel = crossval.fit(train)
prediction = cvModel.transform(test)
bestModel = cvModel.bestModel
#applicable to your model to pull list of all stages
for x in range(len(bestModel.stages)):
print bestModel.stages[x]
#get stage feature by calling correct Transformer then .get<parameter>()
print bestModel.stages[3].getNumFeatures()
这可能不如 wernerchao 的答案好(因为在变量中存储超参数不方便),但您可以通过这种方式快速查看交叉验证模型的最佳超参数:
cvModel.getEstimatorParamMaps()[ np.argmax(cvModel.avgMetrics) ]
(2020-05-21)
我知道这是一个老问题,但我找到了解决这个问题的方法。
@Pierre Gourseaud 为我们提供了一种获取最佳模型超参数的好方法
hyperparams = model_cv.getEstimatorParamMaps()[np.argmax(model_cv.avgMetrics)]
print(hyperparams)
[(Param(parent='ALS_cd65d45ab31c', name='implicitPrefs', doc='whether to use implicit preference'),
True),
(Param(parent='ALS_cd65d45ab31c', name='nonnegative', doc='whether to use nonnegative constraint for least squares'),
True),
(Param(parent='ALS_cd65d45ab31c', name='coldStartStrategy', doc="strategy for dealing with unknown or new users/items at prediction time. This may be useful in cross-validation or production scenarios, for handling user/item ids the model has not seen in the training data. Supported values: 'nan', 'drop'."),
'drop'),
(Param(parent='ALS_cd65d45ab31c', name='rank', doc='rank of the factorization'),
28),
(Param(parent='ALS_cd65d45ab31c', name='maxIter', doc='max number of iterations (>= 0).'),
20),
(Param(parent='ALS_cd65d45ab31c', name='regParam', doc='regularization parameter (>= 0).'),
0.01),
(Param(parent='ALS_cd65d45ab31c', name='alpha', doc='alpha for implicit preference'),
20.0)]
但这不是时尚的样子,所以你可以这样做:
import re
hyper_list = []
for i in range(len(hyperparams.items())):
hyper_name = re.search("name='(.+?)'", str([x for x in hyperparams.items()][i])).group(1)
hyper_value = [x for x in hyperparams.items()][i][1]
hyper_list.append({hyper_name: hyper_value})
print(hyper_list)
[{'implicitPrefs': True}, {'nonnegative': True}, {'coldStartStrategy': 'drop'}, {'rank': 28}, {'maxIter': 20}, {'regParam': 0.01}, {'alpha': 20.0}]
在我的例子中,我训练了一个 ALS 模型,但它应该适用于你的情况,因为我也用 CrossValidation 训练过!
我正在修改 PySpark 文档中的一些交叉验证代码,并试图让 PySpark 告诉我选择了哪个模型:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
dataset = sqlContext.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
(Vectors.dense([0.4]), 1.0),
(Vectors.dense([0.5]), 0.0),
(Vectors.dense([0.6]), 1.0),
(Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
运行 在 PySpark shell 中,我可以获得线性回归模型的系数,但我似乎无法找到交叉验证程序选择的 lr.regParam
的值.有什么想法吗?
In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])
In [4]: cvModel.bestModel.explainParams()
Out[4]: ''
In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}
In [15]: cvModel.params
Out[15]: []
In [36]: cvModel.bestModel.params
Out[36]: []
我也撞到了这堵墙上,不幸的是,您只能获得 特定 模型的 特定 参数。幸运的是,对于逻辑回归,您可以访问截距和权重,遗憾的是您无法检索 regParam。 这可以通过以下方式完成:
best_lr = cv.bestModel
#get weigths
best_lr.weights
>>>DenseVector([3.1573])
#or better
best_lr.coefficients
>>>DenseVector([3.1573])
#get intercept
best_lr.intercept
>>>-1.0829958115287153
正如我之前所写,每个模型都有很少的参数可以提取。 总体而言,从管道获取相关模型(例如,当交叉验证器在管道上运行时 cv.bestModel)可以通过以下方式完成:
best_pipeline = cv.bestModel
best_pipeline.stages
>>>[Tokenizer_4bc8884ad68b4297fd3c,CountVectorizer_411fbdeb4100c2bfe8ef, PCA_4c538d67e7b8f29ff8d0,LogisticRegression_4db49954edc7033edc76]
每个模型都是通过简单的列表索引得到的
best_lr = best_pipeline.stages[3]
现在可以应用上面的内容了。
运行也是这个问题。我发现出于某种我不知道为什么的原因,您需要调用 java 属性。所以就这样做:
from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder, CrossValidator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(metricName="mae")
lr = LinearRegression()
grid = ParamGridBuilder().addGrid(lr.maxIter, [500]) \
.addGrid(lr.regParam, [0]) \
.addGrid(lr.elasticNetParam, [1]) \
.build()
lr_cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, \
evaluator=evaluator, numFolds=3)
lrModel = lr_cv.fit(your_training_set_here)
bestModel = lrModel.bestModel
打印出你想要的参数:
>>> print 'Best Param (regParam): ', bestModel._java_obj.getRegParam()
0
>>> print 'Best Param (MaxIter): ', bestModel._java_obj.getMaxIter()
500
>>> print 'Best Param (elasticNetParam): ', bestModel._java_obj.getElasticNetParam()
1
这也适用于 extractParamMap()
等其他方法。他们应该尽快解决这个问题。
假设 cvModel3Day 是您的模型名称,可以在 Spark Scala 中提取参数,如下所示
val params = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].extractParamMap()
val depth = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getMaxDepth
val iter = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getMaxIter
val bins = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getMaxBins
val features = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getFeaturesCol
val step = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getStepSize
val samplingRate = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getSubsamplingRate
其实有两个问题:
- 拟合模型的哪些方面(如系数和截距)
bestModel
所使用的元参数是什么。
不幸的是,拟合估计器(模型)的 python api 不允许(容易)直接访问估计器的参数,这使得很难回答后一个问题.
不过,使用 java api 有一个解决方法。为了完整起见,首先完整设置交叉验证模型
%pyspark
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
logit = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[logit])
paramGrid = ParamGridBuilder() \
.addGrid(logit.regParam, [0, 0.01, 0.05, 0.1, 0.5, 1]) \
.addGrid(logit.elasticNetParam, [0.0, 0.1, 0.5, 0.8, 1]) \
.build()
evaluator = BinaryClassificationEvaluator(metricName = 'areaUnderPR')
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=5)
tuned_model = crossval.fit(train)
model = tuned_model.bestModel
然后可以在 java 对象上使用通用方法来获取参数值,而无需显式引用 getRegParam()
:
java_model = model.stages[-1]._java_obj
{param.name: java_model.getOrDefault(java_model.getParam(param.name))
for param in paramGrid[0]}
这将执行以下步骤:
- 获取估计器从最佳模型的最后阶段创建的拟合 logit model:
crossval.fit(..).bestModel.stages[-1]
- 从
_java_obj
获取内部 java 对象
- 从
paramGrid
(这是一个字典列表)中获取所有配置的名称。仅使用第一行,假设它是一个实际的网格,因为每一行都包含相同的键。否则,您需要收集任何行中曾经使用过的所有名称。 - 从java对象中获取对应的
Param<T>
参数标识符。 - 将
Param<T>
实例传给getOrDefault()
函数得到实际值
这花了几分钟来破译,但我想通了。
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
# prenotation: I've built out my model already and I am calling the validator ParamGridBuilder
paramGrid = ParamGridBuilder() \
.addGrid(hashingTF.numFeatures, [1000]) \
.addGrid(linearSVC.regParam, [0.1, 0.01]) \
.addGrid(linearSVC.maxIter, [10, 20, 30]) \
.build()
crossval = CrossValidator(estimator=pipeline,\
estimatorParamMaps=paramGrid,\
evaluator=MulticlassClassificationEvaluator(),\
numFolds=2)
cvModel = crossval.fit(train)
prediction = cvModel.transform(test)
bestModel = cvModel.bestModel
#applicable to your model to pull list of all stages
for x in range(len(bestModel.stages)):
print bestModel.stages[x]
#get stage feature by calling correct Transformer then .get<parameter>()
print bestModel.stages[3].getNumFeatures()
这可能不如 wernerchao 的答案好(因为在变量中存储超参数不方便),但您可以通过这种方式快速查看交叉验证模型的最佳超参数:
cvModel.getEstimatorParamMaps()[ np.argmax(cvModel.avgMetrics) ]
(2020-05-21)
我知道这是一个老问题,但我找到了解决这个问题的方法。
@Pierre Gourseaud 为我们提供了一种获取最佳模型超参数的好方法
hyperparams = model_cv.getEstimatorParamMaps()[np.argmax(model_cv.avgMetrics)]
print(hyperparams)
[(Param(parent='ALS_cd65d45ab31c', name='implicitPrefs', doc='whether to use implicit preference'),
True),
(Param(parent='ALS_cd65d45ab31c', name='nonnegative', doc='whether to use nonnegative constraint for least squares'),
True),
(Param(parent='ALS_cd65d45ab31c', name='coldStartStrategy', doc="strategy for dealing with unknown or new users/items at prediction time. This may be useful in cross-validation or production scenarios, for handling user/item ids the model has not seen in the training data. Supported values: 'nan', 'drop'."),
'drop'),
(Param(parent='ALS_cd65d45ab31c', name='rank', doc='rank of the factorization'),
28),
(Param(parent='ALS_cd65d45ab31c', name='maxIter', doc='max number of iterations (>= 0).'),
20),
(Param(parent='ALS_cd65d45ab31c', name='regParam', doc='regularization parameter (>= 0).'),
0.01),
(Param(parent='ALS_cd65d45ab31c', name='alpha', doc='alpha for implicit preference'),
20.0)]
但这不是时尚的样子,所以你可以这样做:
import re
hyper_list = []
for i in range(len(hyperparams.items())):
hyper_name = re.search("name='(.+?)'", str([x for x in hyperparams.items()][i])).group(1)
hyper_value = [x for x in hyperparams.items()][i][1]
hyper_list.append({hyper_name: hyper_value})
print(hyper_list)
[{'implicitPrefs': True}, {'nonnegative': True}, {'coldStartStrategy': 'drop'}, {'rank': 28}, {'maxIter': 20}, {'regParam': 0.01}, {'alpha': 20.0}]
在我的例子中,我训练了一个 ALS 模型,但它应该适用于你的情况,因为我也用 CrossValidation 训练过!