了解 Spark：Cluster Manager、Master 和 Driver 节点

Question

读完这篇，我想再问一些问题：

集群管理器是一个长运行服务，它在哪个节点运行？
有没有可能Master节点和Driver节点是同一台机器？我想某处应该有一条规则说明这两个节点应该不同？
如果 Driver 节点出现故障，谁负责重新启动应用程序？究竟会发生什么？即主节点、集群管理器和工作节点将如何参与（如果他们参与），以及以什么顺序参与？
类似于上一个问题：万一主节点发生故障，具体会发生什么，谁负责从故障中恢复？

Answer 1

1. The Cluster Manager is a long-running service, on which node it is running?

集群管理器是 Spark 独立模式下的主进程。它可以通过执行 ./sbin/start-master.sh 在任何地方启动，在 YARN 中它将是资源管理器。

2. Is it possible that the Master and the Driver nodes will be the same machine? I presume that there should be a rule somewhere stating that these two nodes should be different?

Master 是每个集群，Driver 是每个应用程序。对于 standalone/yarn 个集群，Spark 目前支持两种部署模式。

在客户端模式下，驱动程序与提交应用程序的客户端在同一进程中启动。
在集群模式下，但是，对于独立，驱动程序是从其中一个Worker & 对于 yarn，它在 应用程序主节点 内启动，客户端进程在完成提交应用程序的职责后立即退出无需等待应用程序完成。

如果在Master节点中提交了--deploy-mode client的应用程序，Master和Driver将在同一个节点上。检查 deployment of Spark application over YARN

3. In the case where the Driver node fails, who is responsible for re-launching the application? And what will happen exactly? i.e. how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order?

如果驱动程序失败，该 submitted/triggered spark 应用程序的所有执行程序任务都将被终止。

4. In the case where the Master node fails, what will happen exactly and who is responsible for recovering from the failure?

主节点故障有两种处理方式。

带 ZooKeeper 的备用 Master：

Utilizing ZooKeeper to provide leader election and some state storage, you can launch multiple Masters in your cluster connected to the same ZooKeeper instance. One will be elected “leader” and the others will remain in standby mode. If the current leader dies, another Master will be elected, recover the old Master’s state, and then resume scheduling. The entire recovery process (from the time the first leader goes down) should take between 1 and 2 minutes. Note that this delay only affects scheduling new applications – applications that were already running during Master failover are unaffected. check here for configurations
使用本地文件系统的单节点恢复：

ZooKeeper is the best way to go for production-level high availability, but if you want to be able to restart the Master if it goes down, FILESYSTEM mode can take care of it. When applications and Workers register, they have enough state written to the provided directory so that they can be recovered upon a restart of the Master process. check here for conf and more details

Answer 2

Cluster Manager是long-运行ning服务，在哪个节点上运行ning?

集群管理器只是资源管理器，即 CPU 和 RAM，SchedulerBackends 用来启动任务。集群管理器不对 Apache Spark 做任何事，只是提供资源，一旦 Spark 执行器启动，它们直接与驱动程序通信以执行运行任务。

您可以通过执行以下命令启动独立的主服务器：

./sbin/start-master.sh

可以在任何地方开始。

到运行 Spark 集群上的应用程序

./bin/spark-shell --master spark://IP:PORT

有没有可能Master节点和Driver节点是同一台机器？我假设某处应该有一条规则说明这两个节点应该不同？

在独立模式下，当您启动机器时，某些 JVM 将 start.YourSparK Master 将启动，并且每台机器上的 Worker JVM 将启动并向 Spark Master 注册。两者都是您启动应用程序或以集群模式提交应用程序的资源 manager.When 驱动程序将在您执行 ssh 以启动该应用程序的任何地方启动。 Driver JVM 将联系 SparK Master for executors(Ex) 并且在独立模式下 Worker 将启动 Ex。所以 Spark Master 是按集群的，而 Driver JVM 是按应用程序的。

如果 Driver 节点出现故障，谁负责重新启动应用程序？究竟会发生什么？即主节点、集群管理器和工作节点将如何参与（如果他们参与），以及以什么顺序参与？

如果 Ex JVM 崩溃，Worker JVM 将启动 Ex，当 Worker JVM 崩溃时，Spark Master 将启动它们。对于具有集群部署模式的 Spark 独立集群，您还可以指定 --supervise 以确保驱动程序在失败且非零退出时自动重启 code.Spark Master 将启动 Driver JVM

同上一个问题：如果Master节点出现故障，究竟会发生什么，谁负责从故障中恢复？

master 失败将导致执行者无法与其通信。因此，他们将停止工作。 master 失败将使 driver 无法与其通信以了解工作状态。因此，您的应用程序将失败。运行ning 应用程序将确认主丢失，但除此之外，这些应用程序应该或多或少像什么都没发生一样继续工作，但有两个重要例外：

1.application将无法优雅地完成。

2.if Spark Master 宕机 Worker 会尝试重新注册WithMaster。如果多次失败，工人就会放弃。

reregisterWithMaster()-- 重新注册到这个 worker 一直与之通信的活动主机。如果有none，则说明这个worker还在bootstrapping，还没有和master建立连接，此时我们应该重新向所有master注册。在 failures.worker 无条件尝试向所有主站重新注册期间，仅向活动主站重新注册很重要，可能会出现种族condition.Error，详见 SPARK-4592:

此时长运行ning 应用程序将无法继续处理，但它仍然不会导致立即失败。相反，应用程序将等待主服务器重新联机（文件系统恢复）或来自新领导者的联系（Zookeeper 模式），如果发生这种情况，它将继续处理。

了解 Spark：Cluster Manager、Master 和 Driver 节点

Understand Spark: Cluster Manager, Master and Driver nodes

failover

hadoop

hadoop-yarn

apache-spark

apache-spark-standalone