强制 Google Cloud 运行容器调度到具有 GPU 的节点上

Question

有没有办法强制将使用 Google Cloud 运行 for Anthos（托管在 GKE 上）部署的服务安排到具有 GPU 的节点池？

我通过 Kubernetes -> Create Cluster -> GPU Accelerated Computing 创建了一个 Kubernetes 集群。这创建了一个 Kubernetes 集群，其中包含一个 gpu-pool-1 节点池，其中包含带有 GPU 的节点，以及一个 standard-pool-1 节点池，其中包含没有 GPU 的节点。

有没有办法可以将 Cloud 运行容器部署到具有 GPU 的节点？也许通过配置自定义命名空间之类的？

请注意，有一个来自将近一年前，但我不认为接受的答案 ("Cloud Run on Kubernetes does not support GPUs") 是完全正确的。

Answer 1

这是一篇关于 Knative 服务开发的 hot topic。

当您的 pods 使用 Knative 服务生成时，目前无法拥有节点选择器和容忍度，但团队正在研究解决方案。

Answer 2

似乎确实有一种方法可以让它工作，至少以一种 hacky 的方式，如 here 所述。

knative Service 配置文件似乎确实接受并尊重 limits: nvidia.com/gpu: 1 参数。虽然 Cloud 运行接口不允许我们自己指定此参数，但我们可以使用 kubectl CLI 手动部署由包含此参数的 yaml 文件定义的 knative 服务。

首先，我们需要创建一个 GKE 集群，其中包含一个 cpu 节点池、一个 gpu 节点池和 Cloud 运行 for Anthos 启用。这可以通过转到 Kubernetes Engine -> Create Cluster -> Selecting "GPU Accelerated Computing" on the left cluster templates bar -> Checking the "Enable Cloud Run for Anthos" 来完成。创建集群后，我们可以单击 "connect" 按钮并启动云 shell。在这里，我们可以创建一个 service.yaml 文件来定义我们的 knative 服务。例如，我们可以从 knative documentation 改编 service.yaml 文件，但指定此服务需要 GPU：

# service.yaml
apiVersion: serving.knative.dev/v1 # Current version of Knative
kind: Service
metadata:
  name: helloworld-go # The name of the app
  namespace: default # The namespace the app will use
spec:
  template:
    spec:
      containers:
        - image: gcr.io/knative-samples/helloworld-go # The URL to the image of the app
          env:
            - name: TARGET # The environment variable printed out by the sample app
              value: "Go Sample v1"
          resources:
            limits:
              nvidia.com/gpu: 1 # The service must be run on a machine with at least one GPU

我们可以使用以下方式部署此服务：

kubectl apply -f service.yaml

并使用以下方法检查其状态：

kubectl get ksvc helloworld-go

helloworld-go 服务只能在包含 GPU 的节点上调度。该服务应该像其他 Cloud 运行 for Anthos 服务一样显示在 Cloud 运行仪表板上。

强制 Google Cloud 运行容器调度到具有 GPU 的节点上

Force Google Cloud Run containers to be scheduled on nodes with a GPU

google-kubernetes-engine

knative-serving

google-cloud-run

强制 Google Cloud 运行 容器调度到具有 GPU 的节点上

Force Google Cloud Run containers to be scheduled on nodes with a GPU

google-kubernetes-engine

knative-serving

google-cloud-run

强制 Google Cloud 运行容器调度到具有 GPU 的节点上