Spark 为 Scala 实现并行交叉验证 api
Spark achieve parallel Cross Validation for Scala api
Pyspark 提供了通过 https://github.com/databricks/spark-sklearn 并行化模型交叉验证的巨大可能性
作为 sklearn 的 GridSearchCV
与
的简单替换
from spark_sklearn import GridSearchCV
如何为 Spark 的 Scala CrossValidator
实现类似的功能,即并行化每个折叠?
从 spark 2.3 开始:
您可以使用 setParallelism(n)
方法和 CrossValidator
或在创建时 来做到这一点。即:
cv.setParallelism(2)
或
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator, \
parallelism=2) // Evaluate up to 2 parameter settings in parallel
火花 2.3 之前:
你不能在 Spark Scala 中这样做。您无法在 Scala Spark 中并行化交叉验证。
如果您已经阅读 spark-sklearn
的文档,GridSearchCV 是并行化的,但模型训练不是。因此,这在规模上是无用的。此外,由于著名的 SPARK-5063
:
,您可以并行化 Spark Scala API 的交叉验证
RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
摘自README.md:
This package contains some tools to integrate the Spark computing framework with the popular scikit-learn machine library. Among other tools:
train and evaluate multiple scikit-learn models in parallel. It is a distributed analog to the multicore implementation included by default in scikit-learn.
convert Spark's Dataframes seamlessly into numpy ndarrays or sparse matrices.
(experimental) distribute Scipy's sparse matrices as a dataset of sparse vectors.
It focuses on problems that have a small amount of data and that can be run in parallel.
for small datasets, it distributes the search for estimator parameters (GridSearchCV in scikit-learn), using Spark,
for datasets that do not fit in memory, we recommend using the distributed implementation in Spark MLlib.
NOTE: This package distributes simple tasks like grid-search cross-validation. It does not distribute individual learning algorithms (unlike Spark MLlib).
Pyspark 提供了通过 https://github.com/databricks/spark-sklearn 并行化模型交叉验证的巨大可能性
作为 sklearn 的 GridSearchCV
与
from spark_sklearn import GridSearchCV
如何为 Spark 的 Scala CrossValidator
实现类似的功能,即并行化每个折叠?
从 spark 2.3 开始:
您可以使用 setParallelism(n)
方法和 CrossValidator
或在创建时 来做到这一点。即:
cv.setParallelism(2)
或
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator, \
parallelism=2) // Evaluate up to 2 parameter settings in parallel
火花 2.3 之前:
你不能在 Spark Scala 中这样做。您无法在 Scala Spark 中并行化交叉验证。
如果您已经阅读 spark-sklearn
的文档,GridSearchCV 是并行化的,但模型训练不是。因此,这在规模上是无用的。此外,由于著名的 SPARK-5063
:
RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
摘自README.md:
This package contains some tools to integrate the Spark computing framework with the popular scikit-learn machine library. Among other tools:
train and evaluate multiple scikit-learn models in parallel. It is a distributed analog to the multicore implementation included by default in scikit-learn. convert Spark's Dataframes seamlessly into numpy ndarrays or sparse matrices. (experimental) distribute Scipy's sparse matrices as a dataset of sparse vectors. It focuses on problems that have a small amount of data and that can be run in parallel.
for small datasets, it distributes the search for estimator parameters (GridSearchCV in scikit-learn), using Spark, for datasets that do not fit in memory, we recommend using the distributed implementation in Spark MLlib.
NOTE: This package distributes simple tasks like grid-search cross-validation. It does not distribute individual learning algorithms (unlike Spark MLlib).