Spark ML 和 MLLIB 包之间有什么区别

What's the difference between Spark ML and MLLIB packages

我注意到 SparkML 中有两个 LinearRegressionModel 类,一个在 ML 包 (spark.ml) 中,另一个在 MLLib (spark.mllib)包。

这两个实现方式完全不同 - 例如来自 MLLib 的一个实现了 Serializable,而另一个没有。

顺便说一句,RandomForestModelWord2Vec也是如此。

为什么有两个类?哪个是“正确”的?有没有办法将一个转换成另一个?

o.a.s.mllib 包含旧的基于 RDD 的 API 而 o.a.s.ml 包含围绕 Dataset 和 ML 管道构建的新 API。 mlmllib 在 2.0.0 中达到了功能对等,并且 mllib 正在慢慢被弃用(这在线性回归的情况下已经发生)并且很可能会在下一个主要版本中删除。

因此,除非您的目标是向后兼容,否则 "right choice" 是 o.a.s.ml

Spark Mllib

spark.mllib 包含构建在 RDD 之上的遗留 API。

Spark ML

spark.ml 提供构建在 DataFrames 之上的更高级别 API,用于构建 ML 管道。

根据the official announcement

As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package. Apache spark is recommended to use spark.ml

  • MLlib will still support the RDD-based API in spark.mllib with bug fixes.

  • MLlib will not add new features to the RDD-based API.

  • In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.

  • After reaching feature parity (roughly estimated for Spark 2.3), the RDD-based API will be deprecated.

  • The RDD-based API is expected to be removed in Spark 3.0.

Why is MLlib switching to the DataFrame-based API?

  • DataFrames provide a more user-friendly API than RDDs. The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages.

  • The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.

  • DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the Pipelines guide for details.

更多信息:Machine Learning Library (MLlib) Guide