为什么要增加spark.yarn.executor.memoryOverhead？

Why increase spark.yarn.executor.memoryOverhead?

我正在尝试加入两个大型 spark 数据帧并将运行保存到此错误中：

Container killed by YARN for exceeding memory limits. 24 GB of 22 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

这似乎是 spark 用户中的一个常见问题，但我似乎找不到任何关于 spark.yarn.executor.memoryOverheard 是什么的可靠描述。在某些情况下，这听起来像是 YARN 杀死容器之前的一种内存缓冲区（例如，请求了 10GB，但 YARN 在使用 10.2GB 之前不会杀死容器）。在其他情况下，它听起来像是被用来执行某种完全独立于我要执行的分析的数据统计任务。我的问题是：

spark.yarn.executor.memoryOverhead 有什么用？
增加这种内存而不是增加内存有什么好处执行者内存（或执行者数量）？
一般来说，我可以采取哪些措施来减少我的 spark.yarn.executor.memoryOverhead 用法（例如特定的数据结构，限制数据帧的宽度，使用更少的执行程序和更多的内存等）？

开销选项在 the configuration document:

中有很好的解释

This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%).

如果您使用一种非 JVM 来宾语言（Python、R 等...），这也包括用户对象。

为什么要增加spark.yarn.executor.memoryOverhead？

Why increase spark.yarn.executor.memoryOverhead?

hadoop-yarn

apache-spark