Spark 如何在 Hadoop YARN 上准备执行器？

Question

我正在尝试了解 Spark 如何准备执行程序的详细信息。为此，我尝试调试 org.apache.spark.executor.CoarseGrainedExecutorBackend 并调用

Thread.currentThread().getContextClassLoader.getResource("")

指向以下目录：

/hadoop/yarn/local/usercache/_MY_USER_NAME_/appcache/application_1507907717252_15771/container_1507907717252_15771_01_000002/

查看目录我发现了以下文件：

default_container_executor_session.sh
default_container_executor.sh
launch_container.sh
__spark_conf__
__spark_libs__

问题是谁将文件传送给每个执行程序，然后使用适当的类路径运行 CoarseGrainedExecutorBackend？脚本是什么？它们都是 YARN 自动生成的吗？

我看了org.apache.spark.deploy.SparkSubmit，但没找到有用的东西。

Answer 1

哎哟...您要求提供有关 Spark 在请求资源时如何与集群管理器通信的大量详细信息。让我给你一些信息。继续问你是否想要更多...

您正在使用 Hadoop YARN 作为 Spark 应用程序的集群管理器。让我们只关注这个特定的集群管理器（因为还有其他 Spark 支持的集群管理器，如 Apache Mesos、Spark Standalone、DC/OS 以及即将推出的 Kubernetes，它们有自己的方式来处理 Spark 部署）。

默认情况下，在使用 spark-submit 提交 Spark 应用程序时，Spark 应用程序（即它实际使用的 SparkContext）请求三个 YARN 容器。一个容器用于该 Spark 应用程序的 ApplicationMaster，它知道如何与 YARN 通信并为两个 Spark 执行程序请求另外两个 YARN 容器。

您可以查看 YARN 官方文档 Apache Hadoop YARN and Hadoop: Writing YARN Applications 以更深入地了解 YARN 内部结构。

在提交 Spark 应用程序时，Spark 的 ApplicationMaster is submitted to YARN using the YARN "protocol" that requires that the request for the very first YARN container (container 0) uses ContainerLaunchContext that holds all the necessary launch details (see Client.createContainerLaunchContext).

who delivers the files to each executor

这就是 YARN 被告知如何为 Spark 应用程序启动 ApplicationMaster 的方式。在满足对 ApplicationMaster 容器的请求时，YARN 下载您在容器工作中找到的必要文件 space.

这与任何 YARN 应用程序在 YARN 上的工作方式有关，与 Spark（几乎）无关。

负责通信的代码在 Spark 的 Client 中，尤其是。 Client.submitApplication.

and then just runs CoarseGrainedExecutorBackend with the appropriate classpath.

引用 Mastering Apache Spark 2 gitbook:

CoarseGrainedExecutorBackend is a standalone application that is started in a resource container when (...) Spark on YARN’s ExecutorRunnable is started.

ExecutorRunnable 当 Spark on YARN 的 YarnAllocator 在分配的 YARN 资源容器中调度它时启动。

What are the scripts? Are they all YARN-autogenerated?

有点。

有些是由 Spark 准备的，作为 Spark 应用程序提交的一部分，而另一些则是特定于 YARN 的。

在您的 Spark 应用程序中启用 DEBUG 日志级别，您将看到文件传输。

你可以在我的 Running Spark on YARN and the Mastering Apache Spark 2 gitbook 的 Spark 官方文档中找到更多信息。

Spark 如何在 Hadoop YARN 上准备执行器？

How does Spark prepare executors on Hadoop YARN?

hadoop-yarn

apache-spark