AWS Glue 执行程序内存限制

AWS Glue executor memory limit

我发现 AWS Glue 将执行程序的实例设置为内存限制为 5 Gb --conf spark.executor.memory=5g，有时，在大型数据集上它会失败 java.lang.OutOfMemoryError。驱动程序实例 --spark.driver.memory=5g 也是如此。有没有增加这个值的选项？

您可以通过编辑作业和添加作业参数来覆盖参数。我使用的键和值在这里：

密钥: --conf

值：spark.yarn.executor.memoryOverhead=7g

这似乎违反直觉，因为设置键实际上在值中，但它被识别了。因此，如果您尝试设置 spark.yarn.executor.memory，则以下参数是合适的：

密钥: --conf

价值：spark.yarn.executor.memory=7g

official glue documentation 表明 glue 不支持自定义 spark 配置。

There are also several argument names used by AWS Glue internally that you should never set:

--conf — Internal to AWS Glue. Do not set!

--debug — Internal to AWS Glue. Do not set!

--mode — Internal to AWS Glue. Do not set!

--JOB_NAME — Internal to AWS Glue. Do not set!

关于解决这个问题有更好的建议吗？

打开 Glue> 作业 > 编辑您的作业> 脚本库和作业参数（可选）> 靠近底部的作业参数
设置以下内容：键：--conf 值：spark.yarn.executor.memoryOverhead=1024 spark.driver.memory=10g

当我有一个高度倾斜的数据集时，我遇到了这样的内存不足错误。就我而言，我有一个 json 文件桶，其中包含根据 json 中指示的事件类型而不同的动态有效负载。无论我是否使用此处指示的配置标志并增加 DPU，我都会遇到内存不足错误。事实证明，我的事件高度偏向于几个事件类型，占总数据集的 90% 以上。在我向事件类型添加 "salt" 并分解高度倾斜的数据后，我没有遇到任何内存不足错误。

这是 AWS EMR 的博客 post，其中讨论了同样的内存不足错误和高度倾斜的数据。 https://medium.com/thron-tech/optimising-spark-rdd-pipelines-679b41362a8a

尽管 aws 文档指出不应传递 --conf 参数，但我们的 AWS 支持团队告诉我们传递 --conf spark.driver.memory=10g 这更正了我们遇到的问题

您可以使用提供更多内存和磁盘的 Glue G.1X 和 G.2X 工作器类型 space 来扩展需要高内存和吞吐量的 Glue 作业。您也可以编辑 Glue 作业并设置 --conf 值 spark.yarn.executor.memoryOverhead=1024 或 2048 和 spark.driver.memory=10g

AWS Glue 执行程序内存限制

AWS Glue executor memory limit

amazon-web-services

apache-spark

aws-glue