在一个 spark-submit 作业中，执行者和核心的数量应该选择多少？

Question

我有一个 spark structured streaming 作业执行以下操作：

Streams from S3 Folder a file containing json (many json lines... like 12million)

Filters them to exclude a couple of million

Call an external HTTP api with each json (using concurrency)

Write the response data to a Kafka topic

我的源 S3 文件夹最多可以有 48 个或更多文件，因此我使用的是：

.option("maxFilesPerTrigger", 1)

我的 EMR 集群是：（1 个主节点 + 2 个从节点）（每个节点的类型：m5.2xlarge）

各配备8核32GB内存

在我的 spark 作业中，我想知道这些选项应该是什么？

spark-submit \
--master yarn \
--conf spark.dynamicAllocation.enabled=false \
--executor-memory ??g \
--driver-memory ??g \
--executor-cores ?? \
--num-executors ?? \
--queue default \
--deploy-mode cluster \
....

我想平均分配负载，因为我一直在尝试它，我在 HTTP 端点上看到的每秒事务数似乎是 up/down，我认为这是一个直接结果我的参数。我也不想占用整个集群。有什么想法吗？

图表显示被调用的 HTTP 端点每分钟的事务数。

Answer 1

这取决于你的时间要求，其他工作... 首先，您也许应该尝试使用完整集群。

1主+2从=3.
核心数 = 3 * 8 = 24
内存 = 3 * 32 = 96

建议核心数量：5，我们将减少到 4 以确保没有剩余核心。
--executor-cores 4

执行者数量 = 24/4 = 6（1 个主控和 5 个执行者）
--num-executors 5

executor-memory/driver-memory : (6/96)- ~10% = 14g

最终参数：

spark-submit \
--master yarn \
--conf spark.dynamicAllocation.enabled=false \
--executor-memory 14g \
--driver-memory 14g \
--executor-cores 4 \
--num-executors 5 \
--queue default \
--deploy-mode cluster \
....

您可以轻松地从驱动程序中删除一些 Go 以将其提供给执行程序..

在一个 spark-submit 作业中，执行者和核心的数量应该选择多少？

How much should one choose the number of executors and cores on a spark-submit job?

apache-spark

spark-streaming