AttributeError: 'PipelineModel' object has no attribute 'fitMultiple'

Question

我正在尝试使用 pyspark、CrossValidator 和 BinaryClassificationEvaluator、CrossValidator 调整随机森林模型，但是当我这样做时出现错误。这是我的代码。

from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline

# Create a spark RandomForestClassifier using all default parameters.
# Create a training, and testing df
training_df, testing_df = raw_data_df.randomSplit([0.6, 0.4])

# build a pipeline for analysis
va = VectorAssembler().setInputCols(training_df.columns[0:110:]).setOutputCol('features')

# featuresCol="features"
rf = RandomForestClassifier(labelCol="quality")

# Train the model and calculate the AUC using a BinaryClassificationEvaluator
rf_pipeline = Pipeline(stages=[va, rf]).fit(training_df)

bce = BinaryClassificationEvaluator(labelCol="quality")

# Check AUC before tuning
bce.evaluate(rf_pipeline.transform(testing_df))


from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

paramGrid = ParamGridBuilder().build()

crossValidator = CrossValidator(estimator=rf_pipeline, 
                          estimatorParamMaps=paramGrid, 
                          evaluator=bce, 
                          numFolds=3)

model = crossValidator.fit(training_df)

它抛出了这个错误：

AttributeError: 'PipelineModel' object has no attribute 'fitMultiple'

我该如何解决这个问题？

Answer 1

CrossValidator 估算器采用 Pipeline 对象而不是 Pipeline 模型。

请检查此示例以供参考- https://github.com/apache/spark/blob/master/examples/src/main/python/ml/cross_validator.py

您的代码应修改如下

创建管道

rf_pipe = Pipeline(stages=[va, rf])

将该管道用作交叉验证器中的估计器

crossValidator = CrossValidator(estimator=rf_pipe, 
                          estimatorParamMaps=paramGrid, 
                          evaluator=bce, 
                          numFolds=3)

总体-

....

# Train the model and calculate the AUC using a BinaryClassificationEvaluator
rf_pipe = Pipeline(stages=[va, rf])
rf_pipeline = rf_pipe.fit(training_df)

...

crossValidator = CrossValidator(estimator=**rf_pipe**, 
                          estimatorParamMaps=paramGrid, 
                          evaluator=bce, 
                          numFolds=3)

model = crossValidator.fit(training_df)

AttributeError: 'PipelineModel' object has no attribute 'fitMultiple'

AttributeError: 'PipelineModel' object has no attribute 'fitMultiple'

python

machine-learning

pyspark

apache-spark-mllib

您的代码应修改如下