Spark JobServer，释放内存设置

Question

我已经设置了 spark-jobserver 以在简化的数据集上启用复杂查询。

作业服务器执行两个操作：

与主要的远程数据库同步，它转储一些服务器的 tables，减少和聚合数据，将结果保存为 parquet 文件并将其缓存为 sql table 在记忆中。这个操作每天都会做；
查询，同步操作完成后，用户可以对聚合数据集执行 SQL 复杂查询，（最终）将结果导出为 csv 文件。每个用户一次只能查询一个，等待查询完成。

最大的table（缩减前后，也包括一些连接）有将近30M的行，至少有30个字段。

实际上，我正在使用 32GB 内存专用于作业服务器的开发机器，一切运行顺利。问题是在生产环境中，我们与 PredictionIO 服务器共享相同数量的 ram。

我在问如何确定内存配置以避免内存泄漏或 spark 崩溃。

我是新手，所以接受所有参考或建议。

谢谢

Answer 1

举个例子，如果你有一台 32g 内存的服务器。设置以下参数：

 spark.executor.memory = 32g

记录一下：

The likely first impulse would be to use --num-executors 6 --executor-cores 15 --executor-memory 63G. However, this is the wrong approach because:

63GB + the executor memory overhead won’t fit within the 63GB capacity of the NodeManagers. The application master will take up a core on one of the nodes, meaning that there won’t be room for a 15-core executor on that node. 15 cores per executor can lead to bad HDFS I/O throughput.

A better option would be to use --num-executors 17 --executor-cores 5 --executor-memory 19G. Why?

This config results in three executors on all nodes except for the one with the AM, which will have two executors. --executor-memory was derived as (63/3 executors per node) = 21. 21 * 0.07 = 1.47. 21 – 1.47 ~ 19.

如果你想了解更多，这里有解释： http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

Spark JobServer，释放内存设置

Spark JobServer, memory settings for release

memory

apache-spark

spark-jobserver