自定义预测容器的 ModelUploadOp 步骤失败

ModelUploadOp step failing with custom prediction container

我目前正在尝试部署 Vertex 管道以实现以下目标:

  1. 训练自定义模型(来自自定义训练 python 包)并转储模型工件(训练模型和数据预处理器,将在预测时进行 sed)。这一步工作正常,因为我可以看到存储桶中正在创建新资源。

  2. 通过ModelUploadOp创建模型资源。在下面的 errors 部分中指定 serving_container_environment_variablesserving_container_ports 以及错误消息时,此步骤由于某种原因失败。这有点令人惊讶,因为预测容器都需要它们,并且环境变量作为文档中指定的字典传递。
    此步骤使用 gcloud 命令工作得很好:

gcloud ai models upload \
    --region us-west1 \
    --display-name session_model_latest \
    --container-image-uri gcr.io/and-reporting/pred:latest \
    --container-env-vars="MODEL_BUCKET=ml_session_model" \
    --container-health-route=//health \
    --container-predict-route=//predict \
    --container-ports=5000
  1. 创建端点。
  2. 将模型部署到端点。

显然我在使用 Vertex 时遇到了一些问题,组件 documentation 在这种情况下没有多大帮助。

流水线

from datetime import datetime

import kfp
from google.cloud import aiplatform
from google_cloud_pipeline_components import aiplatform as gcc_aip
from kfp.v2 import compiler

PIPELINE_ROOT = "gs://ml_model_bucket/pipeline_root"


@kfp.dsl.pipeline(name="session-train-deploy", pipeline_root=PIPELINE_ROOT)
def pipeline():
    training_op = gcc_aip.CustomPythonPackageTrainingJobRunOp(
        project="my-project",
        location="us-west1",
        display_name="train_session_model",
        model_display_name="session_model",
        service_account="name@my-project.iam.gserviceaccount.com",
        environment_variables={"MODEL_BUCKET": "ml_session_model"},
        python_module_name="trainer.train",
        staging_bucket="gs://ml_model_bucket/",
        base_output_dir="gs://ml_model_bucket/",
        args=[
            "--gcs-data-path",
            "gs://ml_model_data/2019-Oct_short.csv",
            "--gcs-model-path",
            "gs://ml_model_bucket/model/model.joblib",
            "--gcs-preproc-path",
            "gs://ml_model_bucket/model/preproc.pkl",
        ],
        container_uri="us-docker.pkg.dev/vertex-ai/training/scikit-learn-cpu.0-23:latest",
        python_package_gcs_uri="gs://ml_model_bucket/trainer-0.0.1.tar.gz",
        model_serving_container_image_uri="gcr.io/my-project/pred",
        model_serving_container_predict_route="/predict",
        model_serving_container_health_route="/health",
        model_serving_container_ports=[5000],
        model_serving_container_environment_variables={
            "MODEL_BUCKET": "ml_model_bucket/model"
        },
    )

    model_upload_op = gcc_aip.ModelUploadOp(
        project="and-reporting",
        location="us-west1",
        display_name="session_model",
        serving_container_image_uri="gcr.io/my-project/pred:latest",
        # When passing the following 2 arguments this step fails...
        serving_container_environment_variables={"MODEL_BUCKET": "ml_model_bucket/model"},
        serving_container_ports=[5000],
        serving_container_predict_route="/predict",
        serving_container_health_route="/health",
    )
    model_upload_op.after(training_op)

    endpoint_create_op = gcc_aip.EndpointCreateOp(
        project="my-project",
        location="us-west1",
        display_name="pipeline_endpoint",
    )

    model_deploy_op = gcc_aip.ModelDeployOp(
        model=model_upload_op.outputs["model"],
        endpoint=endpoint_create_op.outputs["endpoint"],
        deployed_model_display_name="session_model",
        traffic_split={"0": 100},
        service_account="name@my-project.iam.gserviceaccount.com",
    )
    model_deploy_op.after(endpoint_create_op)


if __name__ == "__main__":
    ts = datetime.now().strftime("%Y%m%d%H%M%S")
    compiler.Compiler().compile(pipeline, "custom_train_pipeline.json")
    pipeline_job = aiplatform.PipelineJob(
        display_name="session_train_and_deploy",
        template_path="custom_train_pipeline.json",
        job_id=f"session-custom-pipeline-{ts}",
        enable_caching=True,
    )
    pipeline_job.submit()

错误和注意事项

  1. 当指定 serving_container_environment_variablesserving_container_ports 时,该步骤失败并出现以下错误:
{'code': 400, 'message': 'Invalid JSON payload received. Unknown name "MODEL_BUCKET" at \'model.container_spec.env[0]\': Cannot find field.\nInvalid value at \'model.container_spec.ports[0]\' (type.googleapis.com/google.cloud.aiplatform.v1.Port), 5000', 'status': 'INVALID_ARGUMENT', 'details': [{'@type': 'type.googleapis.com/google.rpc.BadRequest', 'fieldViolations': [{'field': 'model.container_spec.env[0]', 'description': 'Invalid JSON payload received. Unknown name "MODEL_BUCKET" at \'model.container_spec.env[0]\': Cannot find field.'}, {'field': 'model.container_spec.ports[0]', 'description': "Invalid value at 'model.container_spec.ports[0]' (type.googleapis.com/google.cloud.aiplatform.v1.Port), 5000"}]}]}

注释掉 serving_container_environment_variablesserving_container_ports 时,模型资源已创建,但手动将其部署到端点导致部署失败,没有输出日志。

经过一段时间研究问题后,我偶然发现了 this Github issue. The problem was originated by a mismatch between google_cloud_pipeline_components and kubernetes_api 文档。在这种情况下,serving_container_environment_variables 被键入为 Optional[dict[str, str]],而它应该被键入为 Optional[list[dict[str, str]]]。对于 serving_container_ports 参数也可以找到类似的不匹配。按照 kubernetes 文档传递参数就可以了:

model_upload_op = gcc_aip.ModelUploadOp(
    project="my-project",
    location="us-west1",
    display_name="session_model",
    serving_container_image_uri="gcr.io/my-project/pred:latest",
    serving_container_environment_variables=[
        {"name": "MODEL_BUCKET", "value": "ml_session_model"}
    ],
    serving_container_ports=[{"containerPort": 5000}],
    serving_container_predict_route="/predict",
    serving_container_health_route="/health",
)