RDD 的分布式计算如何在 Apache Spark 中工作？

Question

如果我有一个非常简单的 Spark 程序，它只做：

val rdd2 = sc.textFile("hdfs:///text.txt")
println(rdd.count)

当我使用 yarn-cluster 将这个 spark 程序提交到 YARN 时：

YARN ResourceManager 将协商容器并启动 Spark ApplicationMaster。
然后ApplicationMaster会向ResourceManager注册自己并请求资源
从ResourceManager获取资源规格，ApplicationMaster会在NodeManager上启动容器。

我的问题是因为 Hadoop 中的数据分布在多台机器上（假设上面示例中的 text.txt 分为 3 个块：

Application Master 是否会在每台具有 text.txt 块的计算机上启动？
spark 执行器是已经安装在集群的每个节点上的软件，还是执行器被实例化到由节点上的 ApplicationMaster 启动的容器中？

Answer 1

好问题，但对于这个论坛来说可能太大了。

首先，您的假设大体上是正确的，但时机不对。

The YARN ResourceManager will negotiate a container and lauch the Spark ApplicationMaster.

Then the ApplicationMaster will register itself with the ResourceManager and ask for resources.

Once the resource specifications is obtained from the ResourceManager, the ApplicationMaster will launch the container on the NodeManager.

如果您使用的是 sc（SparkContext），那么这已经发生了。如果您添加或删除执行程序，ResourceManager 可能会有额外的工作，但 SparkContext 仅在分配初始资源后存在。

Does an Application Master get launched on each and every machine that has a text.txt block?

不，但是可以在任何有块的机器上启动工作程序或执行程序。或者，它们可以只在一台机器上启动。但是每个块（可能）都由工作人员读取。在这个 HDFS 案例中，工作人员可以从集群上的任何地方读取。

Is the spark executor a software which is already installed on each and every node of the cluster or does the executor gets instantiated into the container that is launched by the ApplicationMaster on the Node?

可以安装在节点上，也可以在运行时给到节点。 Spark 执行器只是一个罐子。我已经看到它被放入 HDFS 本身，或者作为本地资源放置在集群中的每台机器上。您可以将其放置在工作人员可以访问的任何位置。

RDD 的分布式计算如何在 Apache Spark 中工作？

How does distributed computations on RDD's work in Apache Spark?

hadoop

hadoop-yarn

apache-spark