Pyspark 交叉验证中的问题
Issue in Pyspark Cross Validation
我正在尝试在下面的代码中交叉验证 Pyspark 上的 RF 模型并抛出错误:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
# Your code
trainData = raw_data_
numFolds = 5
rf = RandomForestClassifier(labelCol="Target", featuresCol="Scaled_features")
evaluator = MulticlassClassificationEvaluator() #
pipeline = Pipeline(stages=[rf])
paramGrid = (ParamGridBuilder()\
.addGrid(rf.numTrees, [3, 10])\
.build())
crossval = CrossValidator(
estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=numFolds)
tr_model = crossval.fit(trainData)
但这会导致错误
我的 raw_data_ 变量是:
| features|Position_Group| Scaled_features|Target|
+--------------------+--------------+--------------------+------+
|[173.735992431640...| FWD|[12.9261366722264...| 0|
|[188.975997924804...| FWD|[14.0600087682323...| 0|
|[179.832000732421...| FWD|[13.3796859647366...| 0|
|[155.752807617187...| MID|[11.5881692110224...| 2|
|[176.783996582031...| FWD|[13.1529113184815...| 0|
|[176.783996582031...| MID|[13.1529113184815...| 2|
|[182.880004882812...| FWD|[13.6064606109917...| 0|
|[182.880004882812...| DEF|[13.6064606109917...| 1|
|[182.880004882812...| FWD|[13.6064606109917...| 0|
|[182.880004882812...| MID|[13.6064606109917...| 2|
|[188.975997924804...| DEF|[14.0600087682323...| 1|
|[176.783996582031...| MID|[13.1529113184815...| 2|
|[170.688003540039...| MID|[12.6993631612409...| 2|
|[155.447998046875...| FWD|[11.5654910652351...| 0|
|[188.975997924804...| FWD|[14.0600087682323...| 0|
|[179.832000732421...| MID|[13.3796859647366...| 2|
|[188.975997924804...| MID|[14.0600087682323...| 2|
|[185.927993774414...| FWD|[13.8332341219772...| 0|
|[176.783996582031...| FWD|[13.1529113184815...| 0|
|[188.975997924804...| DEF|[14.0600087682323...| 1|
+--------------------+--------------+--------------------+------+
关于问题发生的原因和位置有什么建议吗?如何解决这个问题?
谢谢
错误说
Error while calling evaluate. Field "label" does not exist.
这表明评估者出了点问题。在您定义的评估器中,您没有指定标签列,因此评估器尝试使用默认的“标签”列,但该列不存在。
要解决这个问题,您需要在实例化评估器时指定标签列,就像您为分类器所做的那样。例如
evaluator = MulticlassClassificationEvaluator(labelCol="Target")
我正在尝试在下面的代码中交叉验证 Pyspark 上的 RF 模型并抛出错误:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
# Your code
trainData = raw_data_
numFolds = 5
rf = RandomForestClassifier(labelCol="Target", featuresCol="Scaled_features")
evaluator = MulticlassClassificationEvaluator() #
pipeline = Pipeline(stages=[rf])
paramGrid = (ParamGridBuilder()\
.addGrid(rf.numTrees, [3, 10])\
.build())
crossval = CrossValidator(
estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=numFolds)
tr_model = crossval.fit(trainData)
但这会导致错误
我的 raw_data_ 变量是:
| features|Position_Group| Scaled_features|Target|
+--------------------+--------------+--------------------+------+
|[173.735992431640...| FWD|[12.9261366722264...| 0|
|[188.975997924804...| FWD|[14.0600087682323...| 0|
|[179.832000732421...| FWD|[13.3796859647366...| 0|
|[155.752807617187...| MID|[11.5881692110224...| 2|
|[176.783996582031...| FWD|[13.1529113184815...| 0|
|[176.783996582031...| MID|[13.1529113184815...| 2|
|[182.880004882812...| FWD|[13.6064606109917...| 0|
|[182.880004882812...| DEF|[13.6064606109917...| 1|
|[182.880004882812...| FWD|[13.6064606109917...| 0|
|[182.880004882812...| MID|[13.6064606109917...| 2|
|[188.975997924804...| DEF|[14.0600087682323...| 1|
|[176.783996582031...| MID|[13.1529113184815...| 2|
|[170.688003540039...| MID|[12.6993631612409...| 2|
|[155.447998046875...| FWD|[11.5654910652351...| 0|
|[188.975997924804...| FWD|[14.0600087682323...| 0|
|[179.832000732421...| MID|[13.3796859647366...| 2|
|[188.975997924804...| MID|[14.0600087682323...| 2|
|[185.927993774414...| FWD|[13.8332341219772...| 0|
|[176.783996582031...| FWD|[13.1529113184815...| 0|
|[188.975997924804...| DEF|[14.0600087682323...| 1|
+--------------------+--------------+--------------------+------+
关于问题发生的原因和位置有什么建议吗?如何解决这个问题?
谢谢
错误说
Error while calling evaluate. Field "label" does not exist.
这表明评估者出了点问题。在您定义的评估器中,您没有指定标签列,因此评估器尝试使用默认的“标签”列,但该列不存在。
要解决这个问题,您需要在实例化评估器时指定标签列,就像您为分类器所做的那样。例如
evaluator = MulticlassClassificationEvaluator(labelCol="Target")