自动缩放 VertexAI 管道组件

Autoscaling VertexAI pipeline components

我正在探索 VertexAI 管道并了解它是一种托管替代方案,例如 AI Platform 管道(您必须部署 GKE 集群才能 运行 Kubeflow 管道)。我不清楚的是 VertexAI 是否会根据负载自动缩放集群。在similar question, it is mentioned that for pipeline steps that use GCP resources such as Dataflow etc., autoscaling will be done automatically. In the google docs, it is mentioned that for components, one can set resources的回答中,例如CPU_LIMIT GPU_LIMIT等。我的问题是,是否可以为任何类型的组件设置这些限制,即Google云管道组件或自定义组件,无论是 Python 基于功能的组件还是那些打包为容器映像的组件?其次,这些限制是否意味着组件资源将自动缩放直到达到这些限制?如果甚至没有指定这些选项会发生什么,那么资源是如何分配的,它们会按照 VertexAI 认为合适的方式自动缩放吗?

相关文档和资源的链接非常有用。

回答您的问题,

1.是否可以为任何类型的组件设置这些限制?

Yes. Because, these limits are applicable to all Kubeflow components and are not specific to any particular type of component. These components could be implemented to perform tasks with a set amount of resources.


2。这些限制是否意味着组件资源将自动缩放直到达到限制?

No, there is no autoscaling performed by Vertex AI. Based on the limits set, Vertex AI chooses one suitable VM to perform the task. Having a pool of workers is supported in Google Cloud Pipeline Components such as “CustomContainerTrainingJobRunOp” and “CustomPythonPackageTrainingJobRunOp” as part of Distributed Training in Vertex AI. Otherwise, only 1 machine is used per step.


3。如果未指定这些限制会怎样? Vertex AI 是否会按其认为合适的方式扩展资源?

If the limits are not specified, an “e2-standard-4” VM is used for task execution as the default option.


编辑:我已经用最新版本的文档更新了链接。