mleap AttributeError: 'Pipeline' object has no attribute 'serializeToBundle'
mleap AttributeError: 'Pipeline' object has no attribute 'serializeToBundle'
我在执行来自 mleap repository 的示例代码时遇到问题。我希望 运行 脚本中的代码而不是 jupyter notebook(示例是 运行)。我的脚本如下:
##################################################################################
# start a local spark session
# https://spark.apache.org/docs/0.9.0/python-programming-guide.html
##################################################################################
from pyspark import SparkContext, SparkConf
conf = SparkConf()
#set app name
conf.set("spark.app.name", "train classifier")
#Run Spark locally with as many worker threads as logical cores on your machine (cores X threads).
conf.set("spark.master", "local[*]")
#number of cores to use for the driver process (only in cluster mode)
conf.set("spark.driver.cores", "1")
#Limit of total size of serialized results of all partitions for each Spark action (e.g. collect)
conf.set("spark.driver.maxResultSize", "1g")
#Amount of memory to use for the driver process
conf.set("spark.driver.memory", "1g")
#Amount of memory to use per executor process (e.g. 2g, 8g).
conf.set("spark.executor.memory", "2g")
#pass configuration to the spark context object along with code dependencies
sc = SparkContext(conf=conf)
from pyspark.sql.session import SparkSession
spark = SparkSession(sc)
##################################################################################
import mleap.pyspark
# # Imports MLeap serialization functionality for PySpark
from mleap.pyspark.spark_support import SimpleSparkSerializer
# Import standard PySpark Transformers and packages
from pyspark.ml.feature import VectorAssembler, StandardScaler, OneHotEncoder, StringIndexer
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import Row
# Create a test data frame
l = [('Alice', 1), ('Bob', 2)]
rdd = sc.parallelize(l)
Person = Row('name', 'age')
person = rdd.map(lambda r: Person(*r))
df2 = spark.createDataFrame(person)
df2.collect()
# Build a very simple pipeline using two transformers
string_indexer = StringIndexer(inputCol='name', outputCol='name_string_index')
feature_assembler = VectorAssembler(
inputCols=[string_indexer.getOutputCol()], outputCol="features")
feature_pipeline = [string_indexer, feature_assembler]
featurePipeline = Pipeline(stages=feature_pipeline)
featurePipeline.fit(df2)
featurePipeline.serializeToBundle("jar:file:/tmp/pyspark.example.zip")
在执行 spark-submit script.py
时出现以下错误:
17/09/18 13:26:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
File "/Users/opringle/Documents/Repos/finn/Magellan/src/no_spark_predict.py", line 58, in <module>
featurePipeline.serializeToBundle("jar:file:/tmp/pyspark.example.zip")
AttributeError: 'Pipeline' object has no attribute 'serializeToBundle'
如有任何帮助,我们将不胜感激!我已经从 pypy 安装了 mleap。
看来你没有按照正确的步骤操作,这里http://mleap-docs.combust.ml/getting-started/py-spark.html说的是
Note: the import of mleap.pyspark needs to happen before any other PySpark libraries are imported.
因此请尝试在 mleap
之后导入您的 SparkContext
我在 运行 时通过附加以下 jar 文件解决了这个问题:
spark-submit --packages ml.combust.mleap:mleap-spark_2.11:0.8.1 script.py
见Here
似乎 MLeap
还没有为 Spark 2.3
做好准备。如果您恰好是 运行 Spark 2.3
,请尝试降级到 2.2
并重试。希望对您有所帮助!
我在执行来自 mleap repository 的示例代码时遇到问题。我希望 运行 脚本中的代码而不是 jupyter notebook(示例是 运行)。我的脚本如下:
##################################################################################
# start a local spark session
# https://spark.apache.org/docs/0.9.0/python-programming-guide.html
##################################################################################
from pyspark import SparkContext, SparkConf
conf = SparkConf()
#set app name
conf.set("spark.app.name", "train classifier")
#Run Spark locally with as many worker threads as logical cores on your machine (cores X threads).
conf.set("spark.master", "local[*]")
#number of cores to use for the driver process (only in cluster mode)
conf.set("spark.driver.cores", "1")
#Limit of total size of serialized results of all partitions for each Spark action (e.g. collect)
conf.set("spark.driver.maxResultSize", "1g")
#Amount of memory to use for the driver process
conf.set("spark.driver.memory", "1g")
#Amount of memory to use per executor process (e.g. 2g, 8g).
conf.set("spark.executor.memory", "2g")
#pass configuration to the spark context object along with code dependencies
sc = SparkContext(conf=conf)
from pyspark.sql.session import SparkSession
spark = SparkSession(sc)
##################################################################################
import mleap.pyspark
# # Imports MLeap serialization functionality for PySpark
from mleap.pyspark.spark_support import SimpleSparkSerializer
# Import standard PySpark Transformers and packages
from pyspark.ml.feature import VectorAssembler, StandardScaler, OneHotEncoder, StringIndexer
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import Row
# Create a test data frame
l = [('Alice', 1), ('Bob', 2)]
rdd = sc.parallelize(l)
Person = Row('name', 'age')
person = rdd.map(lambda r: Person(*r))
df2 = spark.createDataFrame(person)
df2.collect()
# Build a very simple pipeline using two transformers
string_indexer = StringIndexer(inputCol='name', outputCol='name_string_index')
feature_assembler = VectorAssembler(
inputCols=[string_indexer.getOutputCol()], outputCol="features")
feature_pipeline = [string_indexer, feature_assembler]
featurePipeline = Pipeline(stages=feature_pipeline)
featurePipeline.fit(df2)
featurePipeline.serializeToBundle("jar:file:/tmp/pyspark.example.zip")
在执行 spark-submit script.py
时出现以下错误:
17/09/18 13:26:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
File "/Users/opringle/Documents/Repos/finn/Magellan/src/no_spark_predict.py", line 58, in <module>
featurePipeline.serializeToBundle("jar:file:/tmp/pyspark.example.zip")
AttributeError: 'Pipeline' object has no attribute 'serializeToBundle'
如有任何帮助,我们将不胜感激!我已经从 pypy 安装了 mleap。
看来你没有按照正确的步骤操作,这里http://mleap-docs.combust.ml/getting-started/py-spark.html说的是
Note: the import of mleap.pyspark needs to happen before any other PySpark libraries are imported.
因此请尝试在 mleap
SparkContext
我在 运行 时通过附加以下 jar 文件解决了这个问题:
spark-submit --packages ml.combust.mleap:mleap-spark_2.11:0.8.1 script.py
见Here
似乎 MLeap
还没有为 Spark 2.3
做好准备。如果您恰好是 运行 Spark 2.3
,请尝试降级到 2.2
并重试。希望对您有所帮助!