Spark ML 和 MLLIB 包之间有什么区别
What's the difference between Spark ML and MLLIB packages
我注意到 SparkML 中有两个 LinearRegressionModel
类,一个在 ML 包 (spark.ml
) 中,另一个在 MLLib
(spark.mllib
)包。
这两个实现方式完全不同 - 例如来自 MLLib
的一个实现了 Serializable
,而另一个没有。
顺便说一句,RandomForestModel
或Word2Vec
也是如此。
为什么有两个类?哪个是“正确”的?有没有办法将一个转换成另一个?
o.a.s.mllib
包含旧的基于 RDD 的 API 而 o.a.s.ml
包含围绕 Dataset
和 ML 管道构建的新 API。 ml
和 mllib
在 2.0.0 中达到了功能对等,并且 mllib
正在慢慢被弃用(这在线性回归的情况下已经发生)并且很可能会在下一个主要版本中删除。
因此,除非您的目标是向后兼容,否则 "right choice" 是 o.a.s.ml
。
Spark Mllib
spark.mllib 包含构建在 RDD 之上的遗留 API。
Spark ML
spark.ml 提供构建在 DataFrames 之上的更高级别 API,用于构建 ML 管道。
As of Spark 2.0, the RDD-based APIs in the spark.mllib package have
entered maintenance mode. The primary Machine Learning API for Spark
is now the DataFrame-based API in the spark.ml package.
Apache spark is recommended to use spark.ml
MLlib will still support the RDD-based API in spark.mllib with bug fixes.
MLlib will not add new features to the RDD-based API.
In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.
After reaching feature parity (roughly estimated for Spark 2.3), the RDD-based API will be deprecated.
The RDD-based API is expected to be removed in Spark 3.0.
Why is MLlib switching to the DataFrame-based API?
DataFrames provide a more user-friendly API than RDDs. The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages.
The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.
DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the Pipelines guide for details.
我注意到 SparkML 中有两个 LinearRegressionModel
类,一个在 ML 包 (spark.ml
) 中,另一个在 MLLib
(spark.mllib
)包。
这两个实现方式完全不同 - 例如来自 MLLib
的一个实现了 Serializable
,而另一个没有。
顺便说一句,RandomForestModel
或Word2Vec
也是如此。
为什么有两个类?哪个是“正确”的?有没有办法将一个转换成另一个?
o.a.s.mllib
包含旧的基于 RDD 的 API 而 o.a.s.ml
包含围绕 Dataset
和 ML 管道构建的新 API。 ml
和 mllib
在 2.0.0 中达到了功能对等,并且 mllib
正在慢慢被弃用(这在线性回归的情况下已经发生)并且很可能会在下一个主要版本中删除。
因此,除非您的目标是向后兼容,否则 "right choice" 是 o.a.s.ml
。
Spark Mllib
spark.mllib 包含构建在 RDD 之上的遗留 API。
Spark ML
spark.ml 提供构建在 DataFrames 之上的更高级别 API,用于构建 ML 管道。
As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package. Apache spark is recommended to use spark.ml
MLlib will still support the RDD-based API in spark.mllib with bug fixes.
MLlib will not add new features to the RDD-based API.
In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.
After reaching feature parity (roughly estimated for Spark 2.3), the RDD-based API will be deprecated.
The RDD-based API is expected to be removed in Spark 3.0.
Why is MLlib switching to the DataFrame-based API?
DataFrames provide a more user-friendly API than RDDs. The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages.
The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.
DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the Pipelines guide for details.