Spark ML 和 MLLIB 包之间有什么区别

What's the difference between Spark ML and MLLIB packages

我注意到 SparkML 中有两个 LinearRegressionModel 类，一个在 ML 包 (spark.ml) 中，另一个在 MLLib (spark.mllib)包。

这两个实现方式完全不同 - 例如来自 MLLib 的一个实现了 Serializable，而另一个没有。

顺便说一句，RandomForestModel或Word2Vec也是如此。

为什么有两个类？哪个是“正确”的？有没有办法将一个转换成另一个？

o.a.s.mllib 包含旧的基于 RDD 的 API 而 o.a.s.ml 包含围绕 Dataset 和 ML 管道构建的新 API。 ml 和 mllib 在 2.0.0 中达到了功能对等，并且 mllib 正在慢慢被弃用（这在线性回归的情况下已经发生）并且很可能会在下一个主要版本中删除。

因此，除非您的目标是向后兼容，否则 "right choice" 是 o.a.s.ml。

Spark Mllib

spark.mllib 包含构建在 RDD 之上的遗留 API。

Spark ML

spark.ml 提供构建在 DataFrames 之上的更高级别 API，用于构建 ML 管道。

根据the official announcement

As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package. Apache spark is recommended to use spark.ml

MLlib will still support the RDD-based API in spark.mllib with bug fixes.

MLlib will not add new features to the RDD-based API.

In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.

After reaching feature parity (roughly estimated for Spark 2.3), the RDD-based API will be deprecated.

The RDD-based API is expected to be removed in Spark 3.0.

Why is MLlib switching to the DataFrame-based API?

DataFrames provide a more user-friendly API than RDDs. The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages.

The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.

DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the Pipelines guide for details.

Spark ML 和 MLLIB 包之间有什么区别

What's the difference between Spark ML and MLLIB packages

apache-spark

apache-spark-ml

apache-spark-mllib