Dataproc 集群最多并行运行 5 个作业，忽略可用资源

Question

我正在使用 spark 作业将数据从 1200 MS SQL 服务器 table 加载到 BigQuery 中。这都是精心策划的 ETL 过程的一部分，其中 spark 作业由从 PubSub 接收消息的 scala 代码组成。因此，在大约一个小时的时间内收到了 1200 条消息。每条消息都会触发代码（异步），该代码从 table 中读取数据，进行较小的转换，然后写入 BigQuery。该过程本身运行良好。我的问题是，尽管有很多“工作”在等待并且有大量可用资源，但 spark 中的活动工作数量永远不会超过 5 个。

我试过将 spark.driver.cores 提高到 30，但没有任何变化。此外，此设置虽然在 Google 控制台中可见，但似乎无法实现实际的 spark 作业（在 spark UI 中查看时）。这是控制台中的 spark 作业运行：

下面是 spark 作业属性：

这是一个相当大的集群，有大量可用资源：

这里是创建集群的命令行：

gcloud dataproc clusters create odsengine-cluster \
--properties dataproc:dataproc.conscrypt.provider.enable=false,spark:spark.executor.userClassPathFirst=true,spark:spark.driver.userClassPathFirst=true \
--project=xxx \
--region europe-north1 \
--zone europe-north1-a \
--subnet xxx \
--master-machine-type n1-standard-4 \
--worker-machine-type m1-ultramem-40 \
--master-boot-disk-size 30GB \
--worker-boot-disk-size 2000GB \
--image-version 1.4 \
--master-boot-disk-type=pd-ssd \
--worker-boot-disk-type=pd-ssd \
--num-workers=2 \
--scopes cloud-platform \
--initialization-actions gs://xxx/cluster_init/init_actions.sh

以及提交spark作业的命令行：

gcloud dataproc jobs submit spark \
--project=velliv-dwh-development \
--cluster odsengine-cluster \
--region europe-north1 \
--jars gs://velliv-dwh-dev-bu-dcaods/OdsEngine_2.11-0.1.jar \
--class Main \
--properties \
spark.executor.memory=35g,\
spark.executor.cores=2,\
spark.executor.memoryOverhead=2g,\
spark.dynamicAllocation.enabled=true,\
spark.shuffle.service.enabled=true,\
spark.driver.cores=30\
-- yarn

我知道我可以考虑使用分区来分散大型个体 table 的负载，我也曾在另一种情况下成功地工作过，但在这种情况下我只是想要一次加载多个 table 而不对每个 table.

进行分区

Answer 1

关于“大量作业等待和大量资源可用”，我建议您检查 Spark 日志、YARN web UI 和日志，看看是否有待处理的应用程序以及原因。它还有助于检查集群网络 UI 的 YARN 资源利用率监控选项卡。

关于spark.driver.cores的问题，只在集群模式下有效，看这个doc:

Number of cores to use for the driver process, only in cluster mode

Spark 驱动程序运行s 在 Dataproc 中默认处于客户端模式，这意味着驱动程序运行s 在 YARN 之外的主节点上。您可以运行集群模式下的驱动程序作为 YARN 容器属性 spark.submit.deployMode=cluster.

Dataproc 集群最多并行运行 5 个作业，忽略可用资源

Dataproc cluster runs a maximum of 5 jobs in parallel, ignoring available resources

apache-spark

google-cloud-dataproc