Spark 合并与执行者和核心数量的关系

Spark coalesce relationship with number of executors and cores

我提出了一个关于 Spark 的非常愚蠢的问题，因为我想澄清我的困惑。我是 Spark 的新手，仍在尝试了解它的内部工作原理。

比如说，如果我有一个输入文件列表（假设 1000 个），我想在某处处理或写入，并且我想使用合并将我的分区号减少到 100。

现在我运行这个作业有 12 个执行器，每个执行器有 5 个核心，这意味着运行时有 60 个任务。这是否意味着每项任务都将独立地在一个分区上工作？

Round: 1 12 executors each with 5 cores => 60 tasks process 60 partitions
Round: 2 8 executors each with 5 cores => 40 tasks

process the rest of the 40 partitions and 4 executors never place a job for the 2nd time

或者来自同一个执行器的所有任务将在同一个分区上工作？

Round: 1: 12 executors => process 12 partitions
Round: 2: 12 executors => process 12 partitions
Round: 3: 12 executors => process 12 partitions
....
....
....
Round: 9 (96 partitions already processed): 4 executors => process the remaining 4 partitions

Say, if I have a list of input files(assume 1000) which I want to process or write somewhere and I want to use coalesce to reduce my partition number to 100.

在spark中默认[=10=] = hdfs blocks，由于指定coalesce(100)，Spark会将输入数据分成100个分区。

Now I run this job with 12 executors with 5 cores per executor, that means 60 tasks when it runs. Does that mean, each of the tasks will work on one single partition independently?

因为你通过了，所以可能也通过了。

--num-executors 12：应用程序中要启动的执行程序数。

--executor-cores 5 ：每个执行器的核心数。 1 个核心 = 1 个任务 一次

所以分区的执行过程是这样的。

第 1 轮

12 个分区 将由 12 个执行器处理，每个执行器有 5 个任务（线程）。

第 2 轮

12 个分区 将由 12 个执行器处理，每个执行器有 5 个任务（线程）。
.
.
.

回合：9（已处理 96 个分区）：

4 个分区 将由 4 个执行程序处理，每个执行程序有 5 个任务（线程）。

注意： 通常，一些执行者可能会快速完成分配的工作（各种参数，如 data locality、网络 I/O、CPU 等）。因此，它将通过等待配置的调度时间来选择下一个要处理的分区。

Spark 合并与执行者和核心数量的关系

Spark coalesce relationship with number of executors and cores

hadoop

hadoop-yarn

apache-spark

第 1 轮

第 2 轮

回合：9（已处理 96 个分区）：