为什么 spark streaming 从 kafka 接收数据使用的内存比 <executorMemory * executorCount + driverMemory> 多？

Question

我提交了一个Spark Streaming应用到客户端模式的YARN集群如下：

./spark-submit \
--jars $JARS \
--class $APPCLS \
--master yarn-client \
--driver-memory 64m \
--executor-memory 64m \
--conf spark.shuffle.service.enabled=false \
--conf spark.dynamicAllocation.enabled=false  \
--num-executors 6 \
/data/apps/app.jar

executorMemory * executorCount + driverMemory = 64m*6 + 64m = 448m,

但应用实际使用了3968mb。为什么会发生这种情况，如何减少内存使用？

Answer 1

Spark 配置参数 spark.yarn.executor.memoryOverhead 和 spark.yarn.driver.memoryOverhead 在您的情况下默认为 384 MB (docs)。

事实上，YARN 的内存分配粒度 (yarn.scheduler.increment-allocation-mb) 默认为 512 MB。所以一切都四舍五入到它的倍数。

还有一个默认为 1 GB 的最小分配大小 (yarn.scheduler.minimum-allocation-mb)。它要么在您的情况下设置得较低，要么您没有正确查看内存分配。

与您的内存使用相比，所有这些开销都可以忽略不计。您应该将 --executor-memory 设置为 20 GB 或更多。为什么要尝试配置少得离谱的内存？

为什么 spark streaming 从 kafka 接收数据使用的内存比 <executorMemory * executorCount + driverMemory> 多？

Why spark streaming receive data from kafka use more memory than <executorMemory * executorCount + driverMemory>?

apache-spark

spark-streaming