是否有任何可以调整以减少驱动程序节点内存消耗的火花配置参数？

Are there any spark configuration parameters that one can tune in order to decrease the driver node memory consumption?

hadoop-yarn
apache-spark
pyspark

是否有任何可以调整以减少驱动程序节点内存消耗的 spark 配置参数？

我正在使用 pyspark、scikit-learn 和 joblibspark 在 YARN 集群上执行分布式超参数 RandonSearchCV。看起来驱动节点的内存消耗大致等于所有工作节点的内存消耗之和。由于每个节点的内存消耗是有限的，驱动程序节点很快达到这个限制。

最终，我发现库 joblibspark 对这项工作来说非常糟糕，尤其是当你有一个大的（在内存方面）特征矩阵时。因此，我使用本机 pyspark 功能对 scikit-learn 模型“从头开始”实施了随机搜索，这样我就不会在计算结束时在驱动程序节点收集整个结果。我发现 pyspark 中的 pandas UDF 特别有用。

是否有任何可以调整以减少驱动程序节点内存消耗的火花配置参数？

Are there any spark configuration parameters that one can tune in order to decrease the driver node memory consumption?

hadoop-yarn

apache-spark

pyspark