使用管道从 S3 加载 Pyspark.ml 模型
Load Pyspark.ml model from S3 using Pipeline
我正在尝试将经过训练的模型保存到 S3 存储中,然后尝试通过来自 pyspark.ml 的管道包使用该模型进行加载和预测。
这是我如何保存模型的示例。
#stage_1 to stage_4 are some basic trasnformation on data one-hot encoding e.t.c
# define stage 5: logistic regression model
stage_5 = LogisticRegression(featuresCol='features',labelCol='label')
# SETUP THE PIPELINE
regression_pipeline = Pipeline(stages= [stage_1, stage_2, stage_3, stage_4, stage_5])
# fit the pipeline for the trainind data
model = regression_pipeline.fit(dataFrame1)
model_path ="s3://s3-dummy_path-orch/dummy models/pipeline_testing_1.model"
model.save(model_path)
我能够成功保存模型并且在上述模型路径中创建了两个文件夹
- 阶段
- 元数据。
然而,当我尝试加载模型时,出现以下错误。
Traceback (most recent call last):
File "/tmp/pythonScript_85ff2462_e087_4805_9f50_0c75fc4302e2958379757178872310.py", line 75, in <module>
pipelineModel = Pipeline.load(model_path)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 362, in load
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 207, in load
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 300, in load
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 79, in deco
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Error loading metadata: Expected class name org.apache.spark.ml.Pipeline but found class name org.apache.spark.ml.PipelineModel'
我正在尝试加载如下模型:
from pyspark.ml import Pipeline
## same path used while #model.save in the above code snippet
model_path ="s3://s3-dummy_path-orch/dummy models/pipeline_testing_1.model"
pipelineModel = Pipeline.load(model_path)
我该如何纠正这个问题?
如果您保存了管道模型,则应将其加载为管道模型,而不是管道。不同之处在于管道模型适合数据框,但管道不是。
from pyspark.ml import PipelineModel
pipelineModel = PipelineModel.load(model_path)
我正在尝试将经过训练的模型保存到 S3 存储中,然后尝试通过来自 pyspark.ml 的管道包使用该模型进行加载和预测。 这是我如何保存模型的示例。
#stage_1 to stage_4 are some basic trasnformation on data one-hot encoding e.t.c
# define stage 5: logistic regression model
stage_5 = LogisticRegression(featuresCol='features',labelCol='label')
# SETUP THE PIPELINE
regression_pipeline = Pipeline(stages= [stage_1, stage_2, stage_3, stage_4, stage_5])
# fit the pipeline for the trainind data
model = regression_pipeline.fit(dataFrame1)
model_path ="s3://s3-dummy_path-orch/dummy models/pipeline_testing_1.model"
model.save(model_path)
我能够成功保存模型并且在上述模型路径中创建了两个文件夹
- 阶段
- 元数据。
然而,当我尝试加载模型时,出现以下错误。
Traceback (most recent call last):
File "/tmp/pythonScript_85ff2462_e087_4805_9f50_0c75fc4302e2958379757178872310.py", line 75, in <module>
pipelineModel = Pipeline.load(model_path)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 362, in load
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 207, in load
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 300, in load
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 79, in deco
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Error loading metadata: Expected class name org.apache.spark.ml.Pipeline but found class name org.apache.spark.ml.PipelineModel'
我正在尝试加载如下模型:
from pyspark.ml import Pipeline
## same path used while #model.save in the above code snippet
model_path ="s3://s3-dummy_path-orch/dummy models/pipeline_testing_1.model"
pipelineModel = Pipeline.load(model_path)
我该如何纠正这个问题?
如果您保存了管道模型,则应将其加载为管道模型,而不是管道。不同之处在于管道模型适合数据框,但管道不是。
from pyspark.ml import PipelineModel
pipelineModel = PipelineModel.load(model_path)