机器学习与火花，数据准备性能问题，MLeap

Question

我发现了很多关于 Mleap 的好评 - 一个可以快速评分的库。它在一个模型的基础上工作，转换成 MLeap bundle。

但是评分前的数据准备阶段呢？

是否有一些有效的方法可以将 'spark ML data preparation pipeline'（在训练期间工作，但在 spark 框架中工作）转换为健壮的、性能有效的、优化的字节码？

Answer 1

您可以使用 MLeap 轻松序列化整个 PipelineModel（包含特征工程和模型训练）。

注意：以下代码有点旧，您现在可能可以使用更清洁的程序 API..

// Mleap PipelineModel Serialization into a single .zip file
val sparkBundleContext = SparkBundleContext().withDataset(pipelineModel.transform(trainData))
for(bundleFile <- managed(BundleFile(s"jar:file:${mleapSerializedPipelineModel}"))) {
  pipelineModel.writeBundle.save(bundleFile)(sparkBundleContext).get
}

// Mleap code: Deserialize model from local filesystem (without any Spark dependency)
val mleapPipeline = (for(bf <- managed(BundleFile(s"jar:file:${modelPath}"))) yield {
  bf.loadMleapBundle().get.root
}).tried.get

请注意，棘手的部分是如果您在 Spark 中定义自己的 Estimators/Transformers，因为它们也需要相应的 MLeap 版本。

机器学习与火花，数据准备性能问题，MLeap

Machine learning with spark, data preparation performance problem, MLeap

performance

scoring

machine-learning

apache-spark

apache-spark-mllib