YARN 上 GCP Dataproc 的自动缩放指标

Autoscaling metrics on GCP Dataproc on YARN

为什么 GCP Dataproc 的集群基于内存请求NOT核心使用 YARN 作为 RM 自动缩放?是 Dataproc 或 YARN 的限制还是我遗漏了什么?

参考:https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling

Autoscaling configures Hadoop YARN to schedule jobs based on YARN memory requests, not on YARN core requests.

Autoscaling is centered around the following Hadoop YARN metrics:

Allocated memory refers to the total YARN memory taken up by running containers across the whole cluster. If there are 6 running containers that can use up to 1GB, there is 6GB of allocated memory.

Available memory is YARN memory in the cluster not used by allocated containers. If there is 10GB of memory across all node managers and 6GB of allocated memory, there is 4GB of available memory. If there is available (unused) memory in the cluster, autoscaling may remove workers from the cluster.

Pending memory is the sum of YARN memory requests for pending containers. Pending containers are waiting for space to run in YARN. Pending memory is non-zero only if available memory is zero or too small to allocate to the next container. If there are pending containers, autoscaling may add workers to the cluster.

目前这是 Dataproc 的限制。默认情况下,YARN 根据内存请求为容器找到插槽,并完全忽略核心请求。所以在默认配置下,Dataproc 只需要根据 YARN pending/available 内存自动缩放。

肯定存在您希望通过 运行 个更多容器超额订阅 YARN 核心的用例。例如,即使您只有 4 个物理内核,我们的默认 distcp 配置可能在节点管理器上有 8 low-memory 个容器 运行。每个 distcp 任务在很大程度上是 I/O 绑定的,并且不会占用太多内存。所以我认为保留默认仅基于内存调度是合理的。

如果您也对配置基于 YARN 内核的自动缩放感兴趣,我怀疑您已经打开 YARN 的 DominantResourceCalculator 以使 YARN 在内存和内核上进行调度。它在我们的路线图中支持 DominantResourceCalculator。但我们一直优先考虑自动缩放稳定性修复。欢迎私下联系 dataproc-feedback@google.com,告诉我们更多关于您的用例。