尝试在 aws emr 上另存为 table 时 spark 作业超时

Question

我们已经在 AWS 上为我们的应用程序设置了专用集群。

这是内核的配置（我们有 2 个内核）

m5.xlarge
4 vCore, 16 GiB memory, EBS only storage
EBS Storage:64 GiB

当前数据集-

我们正在尝试运行 spark 作业，该作业涉及许多连接并处理 8000 万条记录每条记录有 60 多个字段

我们面临的问题 -

当我们尝试将最终数据帧保存为 athena table 时，它花费了 1 个多小时并超时。

由于我们是唯一一个使用集群的人，我们应该如何配置才能确保我们以最佳方式使用所有集群资源

当前配置

Executor Memory : 2G
Dynamic Allocation Enabled : true
Number of Executor Cores : 1
Number of Executors : 8
spark.dynamicAllocation.executorIdleTimeout : 3600
spark.sql.broadcastTimeout : 36000

Answer 1

观察您的配置 -

您正在使用

m5.xlarge which is having 4 vCore, 16 GiB memory

执行器配置

Number of Executor Cores : 1
Executor Memory : 2G

所以最多可以启动4个executor，4个executor需要的内存是8。所以最后你并没有利用所有的资源。

也如@Shadowtrooper所说，如果可以的话，将数据保存在分区中（如果可能的话，以Parquet格式），在Athena中查询时也会节省成本。

尝试在 aws emr 上另存为 table 时 spark 作业超时

spark job timing out when trying to save as table on aws emr

amazon-web-services

amazon-emr

apache-spark

pyspark