Spark:为什么执行由主节点而不是工作节点执行?

Spark: Why execution is carried by a master node but not worker nodes?

我有一个由一个主节点和两个工作节点组成的 spark 集群。

执行以下代码从数据库中提取数据时,实际执行的是master而不是worker之一。

    sparkSession.read
      .format("jdbc")
      .option("url", jdbcURL)
      .option("user", user)
      .option("query", query)
      .option("driver", driverClass)
      .option("fetchsize", fetchsize)
      .option("numPartitions", numPartitions)
      .option("queryTimeout", queryTimeout)
      .options(options)
      .load()

这是预期的行为吗?

有什么方法可以禁用此行为吗?

Spark 应用程序有两种类型的 运行ners:driver 和 executor,以及两种类型的 operations:transformation 和 action。根据这个doc:

RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).

...

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

所以在Spark应用中,有些操作是在executors中执行的,有些操作是在drivers中执行的。在 Dataproc 上,执行程序始终位于工作节点上的 YARN 容器中。但是驱动程序可以在主节点或工作节点上。默认称为“客户端模式”,这意味着 YARN 之外的主节点上的驱动程序 运行。但是您可以使用 gcloud dataproc jobs submit spark ... --properties spark.submit.deployMode=cluster 启用“集群模式”,这将 运行 工作节点上 YARN 容器中的驱动程序。有关详细信息,请参阅此 doc