针对 EMR 上的 "Excessive" 并行度调整 Spark

Tuning Spark for "Excessive" Parallelism on EMR

我有一个 Spark 作业,它读取一些 TB 的数据并执行两个 window 函数。这项工作在较小的块中运行得很好,4TB 上的 50k 洗牌分区,但是当我将数据输入增加到 15TB 节点的 150k-200k 洗牌分区时开始失败。

发生这种情况有两个原因:

执行器上的 OOM

20/07/01 15:58:14 ERROR YarnClusterScheduler: Lost executor 92 on ip-10-102-125-133.ec2.internal: Container killed by YARN for exceeding memory limits.  22.0 GB of 22 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.

我已经增加了驱动程序的大小以应对大的 shuffle:

执行者 R5.xlarge 具有以下配置:

这远低于 AWS 的最大值:https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-task-config.html#emr-hadoop-task-config-r5

我知道我需要在此处调整 spark.yarn.executor.memoryOverheadFactor 以允许 space 处理与这么多分区相关的大量开销。希望这将是最后需要的更改。

随机播放超时

20/07/01 15:59:39 ERROR TransportChannelHandler: Connection to ip-10-102-116-184.ec2.internal/10.102.116.184:7337 has been quiet for 600000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong.
20/07/01 15:59:39 ERROR TransportResponseHandler: Still have 8 requests outstanding when connection from ip-10-102-116-184.ec2.internal/10.102.116.184:7337 is closed
20/07/01 15:59:39 ERROR OneForOneBlockFetcher: Failed while starting block fetches

我已将此超时调整如下:

我可以进一步增加 conf 中的 spark.network.timeout 来安静它并等待更长的时间。我宁愿减少Shuffle Read Blocked Time,这是从1分钟到30分钟不等。有没有办法提高节点间的通信速率?

我试过调整以下设置,但似乎无法提高这个速度:

我需要调整什么以减少 AWS EMR 上的 Shuffle Read Blocked Time

对于执行器上的 OOM,执行此操作。它为我们解决了这个问题。 来自:https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/

Even if all the Spark configuration properties are calculated and set correctly, virtual out-of-memory errors can still occur rarely as virtual memory is bumped up aggressively by the OS. To prevent these application failures, set the following flags in the YARN site settings.

Best practice 5: Always set the virtual and physical memory check flag to false.

"yarn.nodemanager.vmem-check-enabled":"false",
"yarn.nodemanager.pmem-check-enabled":"false"

原因:"Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used" on an EMR cluster with 75GB of memory

要解决随机播放超时问题,请尝试增加存储空间(EBS 卷)。