Pyspark error: "Field rawPrediction does not exist" when using cross validation
Pyspark error: "Field rawPrediction does not exist" when using cross validation
我一直在尝试对我的训练数据使用 CrossValidator
,但我总是收到错误消息:
"An error occurred while calling o80267.evaluate.
: java.lang.IllegalArgumentException: Field "rawPrediction" does not exist.
Available fields: label, features, CrossValidator_6a7bb791f63f_rand, features_scaled, prediction"
这是代码:
df = spark.createDataFrame(input_data, ["label", "features"])
train_data, test_data = df.randomSplit([.8,.2],seed=1234)
train_data.show()
standardScaler = StandardScaler(inputCol="features", outputCol="features_scaled")
lr = LinearRegression(maxIter=10)
pipeline = Pipeline(stages=[standardScaler, lr])
paramGrid = ParamGridBuilder()\
.addGrid(lr.regParam, [0.3, 0.1, 0.01])\
.addGrid(lr.fitIntercept, [False, True])\
.addGrid(lr.elasticNetParam, [0.0, 0.5, 0.8, 1.0])\
.build()
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=2)
cvModel = crossval.fit(train_data)
当使用 train_data.show()
(第三行)时,输出如下:
+-----+--------------------+
|label| features|
+-----+--------------------+
|4.526|[129.0,322.0,126....|
|3.585|[1106.0,2401.0,11...|
|3.521|[190.0,496.0,177....|
|3.413|[235.0,558.0,219....|
|3.422|[280.0,565.0,259....|
|2.697|[213.0,413.0,193....|
|2.992|[489.0,1094.0,514...|
|2.414|[687.0,1157.0,647...|
|2.267|[665.0,1206.0,595...|
|2.611|[707.0,1551.0,714...|
|2.815|[434.0,910.0,402....|
|2.418|[752.0,1504.0,734...|
|2.135|[474.0,1098.0,468...|
|1.913|[191.0,345.0,174....|
|1.592|[626.0,1212.0,620...|
| 1.4|[283.0,697.0,264....|
|1.525|[347.0,793.0,331....|
|1.555|[293.0,648.0,303....|
|1.587|[455.0,990.0,419....|
|1.629|[298.0,690.0,275....|
+-----+--------------------+
我搜索了rawPrediction
,但至少我是这样理解的,这个列是在测试数据DF转换后才添加的。那么我在这里做错了什么,为什么会出现这个错误?我是否将某些列命名为错误?我还将 scaled_features
重命名为 features
但这显然没有帮助。
您在回归问题中错误地使用了 BinaryClassificationEvaluator
,并且由于 rawPrediction
仅用于分类模型而不用于回归模型,因此您的评估器查找列 rawPrediction
,找不到它,returns 一个错误。
如下更改交叉验证器:
from pyspark.ml.evaluation import RegressionEvaluator
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=RegressionEvaluator(),
numFolds=2)
你应该没事的。
我一直在尝试对我的训练数据使用 CrossValidator
,但我总是收到错误消息:
"An error occurred while calling o80267.evaluate.
: java.lang.IllegalArgumentException: Field "rawPrediction" does not exist.
Available fields: label, features, CrossValidator_6a7bb791f63f_rand, features_scaled, prediction"
这是代码:
df = spark.createDataFrame(input_data, ["label", "features"])
train_data, test_data = df.randomSplit([.8,.2],seed=1234)
train_data.show()
standardScaler = StandardScaler(inputCol="features", outputCol="features_scaled")
lr = LinearRegression(maxIter=10)
pipeline = Pipeline(stages=[standardScaler, lr])
paramGrid = ParamGridBuilder()\
.addGrid(lr.regParam, [0.3, 0.1, 0.01])\
.addGrid(lr.fitIntercept, [False, True])\
.addGrid(lr.elasticNetParam, [0.0, 0.5, 0.8, 1.0])\
.build()
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=2)
cvModel = crossval.fit(train_data)
当使用 train_data.show()
(第三行)时,输出如下:
+-----+--------------------+
|label| features|
+-----+--------------------+
|4.526|[129.0,322.0,126....|
|3.585|[1106.0,2401.0,11...|
|3.521|[190.0,496.0,177....|
|3.413|[235.0,558.0,219....|
|3.422|[280.0,565.0,259....|
|2.697|[213.0,413.0,193....|
|2.992|[489.0,1094.0,514...|
|2.414|[687.0,1157.0,647...|
|2.267|[665.0,1206.0,595...|
|2.611|[707.0,1551.0,714...|
|2.815|[434.0,910.0,402....|
|2.418|[752.0,1504.0,734...|
|2.135|[474.0,1098.0,468...|
|1.913|[191.0,345.0,174....|
|1.592|[626.0,1212.0,620...|
| 1.4|[283.0,697.0,264....|
|1.525|[347.0,793.0,331....|
|1.555|[293.0,648.0,303....|
|1.587|[455.0,990.0,419....|
|1.629|[298.0,690.0,275....|
+-----+--------------------+
我搜索了rawPrediction
,但至少我是这样理解的,这个列是在测试数据DF转换后才添加的。那么我在这里做错了什么,为什么会出现这个错误?我是否将某些列命名为错误?我还将 scaled_features
重命名为 features
但这显然没有帮助。
您在回归问题中错误地使用了 BinaryClassificationEvaluator
,并且由于 rawPrediction
仅用于分类模型而不用于回归模型,因此您的评估器查找列 rawPrediction
,找不到它,returns 一个错误。
如下更改交叉验证器:
from pyspark.ml.evaluation import RegressionEvaluator
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=RegressionEvaluator(),
numFolds=2)
你应该没事的。