pyspark 中的交叉验证
cross validation in pyspark
我使用交叉验证使用以下代码训练线性回归模型:
from pyspark.ml.evaluation import RegressionEvaluator
lr = LinearRegression(maxIter=maxIteration)
modelEvaluator=RegressionEvaluator()
pipeline = Pipeline(stages=[lr])
paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01]).addGrid(lr.elasticNetParam, [0, 1]).build()
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=modelEvaluator,
numFolds=3)
cvModel = crossval.fit(training)
现在我想绘制roc曲线,我使用了下面的代码但是我得到了这个错误:
'LinearRegressionTrainingSummary'对象没有属性'areaUnderROC'
trainingSummary = cvModel.bestModel.stages[-1].summary
trainingSummary.roc.show()
print("areaUnderROC: " + str(trainingSummary.areaUnderROC))
我也想在每次迭代时检查objectiveHistory,我知道我可以在最后得到它
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
但我想在每次迭代时都获取它,我该怎么做?
而且我想在测试数据上评估模型,我该怎么做?
prediction = cvModel.transform(test)
我知道我可以写的训练数据集:
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)
但是我怎样才能得到这些用于测试数据集的指标呢?
1) ROC 曲线下面积 (AUC) 为 defined 仅适用于 二元分类 ,因此您不能将其用于回归任务,因为您正在尝试在这里做。
2) 每次迭代的objectiveHistory
仅在回归中的solver
参数为l-bfgs
时可用(documentation);这是一个玩具示例:
spark.version
# u'2.1.1'
from pyspark.ml import Pipeline
from pyspark.ml.linalg import Vectors
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
dataset = spark.createDataFrame(
[(Vectors.dense([0.0]), 0.2),
(Vectors.dense([0.4]), 1.4),
(Vectors.dense([0.5]), 1.9),
(Vectors.dense([0.6]), 0.9),
(Vectors.dense([1.2]), 1.0)] * 10,
["features", "label"])
lr = LinearRegression(maxIter=5, solver="l-bfgs") # solver="l-bfgs" here
modelEvaluator=RegressionEvaluator()
pipeline = Pipeline(stages=[lr])
paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01]).addGrid(lr.elasticNetParam, [0, 1]).build()
crossval = CrossValidator(estimator=lr,
estimatorParamMaps=paramGrid,
evaluator=modelEvaluator,
numFolds=3)
cvModel = crossval.fit(dataset)
trainingSummary = cvModel.bestModel.summary
trainingSummary.totalIterations
# 2
trainingSummary.objectiveHistory # one value for each iteration
# [0.49, 0.4511834723904831]
3) 您已经定义了一个 RegressionEvaluator
可用于评估您的测试集,但如果不带参数使用,它会采用 RMSE 指标;这是一种使用不同指标定义评估器并将其应用于测试集的方法(继续上面的代码):
test = spark.createDataFrame(
[(Vectors.dense([0.0]), 0.2),
(Vectors.dense([0.4]), 1.1),
(Vectors.dense([0.5]), 0.9),
(Vectors.dense([0.6]), 1.0)],
["features", "label"])
modelEvaluator.evaluate(cvModel.transform(test)) # rmse by default, if not specified
# 0.35384585061028506
eval_rmse = RegressionEvaluator(metricName="rmse")
eval_r2 = RegressionEvaluator(metricName="r2")
eval_rmse.evaluate(cvModel.transform(test)) # same as above
# 0.35384585061028506
eval_r2.evaluate(cvModel.transform(test))
# -0.001655087952929124
我使用交叉验证使用以下代码训练线性回归模型:
from pyspark.ml.evaluation import RegressionEvaluator
lr = LinearRegression(maxIter=maxIteration)
modelEvaluator=RegressionEvaluator()
pipeline = Pipeline(stages=[lr])
paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01]).addGrid(lr.elasticNetParam, [0, 1]).build()
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=modelEvaluator,
numFolds=3)
cvModel = crossval.fit(training)
现在我想绘制roc曲线,我使用了下面的代码但是我得到了这个错误:
'LinearRegressionTrainingSummary'对象没有属性'areaUnderROC'
trainingSummary = cvModel.bestModel.stages[-1].summary
trainingSummary.roc.show()
print("areaUnderROC: " + str(trainingSummary.areaUnderROC))
我也想在每次迭代时检查objectiveHistory,我知道我可以在最后得到它
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
但我想在每次迭代时都获取它,我该怎么做?
而且我想在测试数据上评估模型,我该怎么做?
prediction = cvModel.transform(test)
我知道我可以写的训练数据集:
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)
但是我怎样才能得到这些用于测试数据集的指标呢?
1) ROC 曲线下面积 (AUC) 为 defined 仅适用于 二元分类 ,因此您不能将其用于回归任务,因为您正在尝试在这里做。
2) 每次迭代的objectiveHistory
仅在回归中的solver
参数为l-bfgs
时可用(documentation);这是一个玩具示例:
spark.version
# u'2.1.1'
from pyspark.ml import Pipeline
from pyspark.ml.linalg import Vectors
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
dataset = spark.createDataFrame(
[(Vectors.dense([0.0]), 0.2),
(Vectors.dense([0.4]), 1.4),
(Vectors.dense([0.5]), 1.9),
(Vectors.dense([0.6]), 0.9),
(Vectors.dense([1.2]), 1.0)] * 10,
["features", "label"])
lr = LinearRegression(maxIter=5, solver="l-bfgs") # solver="l-bfgs" here
modelEvaluator=RegressionEvaluator()
pipeline = Pipeline(stages=[lr])
paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01]).addGrid(lr.elasticNetParam, [0, 1]).build()
crossval = CrossValidator(estimator=lr,
estimatorParamMaps=paramGrid,
evaluator=modelEvaluator,
numFolds=3)
cvModel = crossval.fit(dataset)
trainingSummary = cvModel.bestModel.summary
trainingSummary.totalIterations
# 2
trainingSummary.objectiveHistory # one value for each iteration
# [0.49, 0.4511834723904831]
3) 您已经定义了一个 RegressionEvaluator
可用于评估您的测试集,但如果不带参数使用,它会采用 RMSE 指标;这是一种使用不同指标定义评估器并将其应用于测试集的方法(继续上面的代码):
test = spark.createDataFrame(
[(Vectors.dense([0.0]), 0.2),
(Vectors.dense([0.4]), 1.1),
(Vectors.dense([0.5]), 0.9),
(Vectors.dense([0.6]), 1.0)],
["features", "label"])
modelEvaluator.evaluate(cvModel.transform(test)) # rmse by default, if not specified
# 0.35384585061028506
eval_rmse = RegressionEvaluator(metricName="rmse")
eval_r2 = RegressionEvaluator(metricName="r2")
eval_rmse.evaluate(cvModel.transform(test)) # same as above
# 0.35384585061028506
eval_r2.evaluate(cvModel.transform(test))
# -0.001655087952929124