在 GKE (Cloud Composer) 中的自动缩放节点上安装 GPU 驱动程序

Question

我正在运行宁 google 云作曲家 GKE 集群。我有一个默认节点池，其中包含 3 个普通 CPU 节点和一个带有 GPU 节点的节点池。 GPU 节点池已激活自动缩放。

我想运行该 GPU 节点上 docker 容器内的脚本。

对于 GPU 操作系统，我决定使用 cos_containerd 而不是 ubuntu。

我关注了 https://cloud.google.com/kubernetes-engine/docs/how-to/gpus 和运行这一行：

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

当我在 GPU 节点上运行 “kubectl describe” 时，GPU 现在会出现，但是我的测试脚本调试信息告诉我，GPU 没有被使用。

当我通过 ssh 连接到自动配置的 GPU 节点时，我可以看到，我仍然需要运行

cos extensions gpu install

为了使用GPU。

我现在想让我的 cloud composer GKE 集群在自动缩放器功能创建节点时运行“cos-extensions gpu install”。

我想应用这样的 yaml:

#cloud-config

runcmd:
  - cos-extensions install gpu

到我的 cloud composer GKE 集群。

我可以用 kubectl apply 做到这一点吗？理想情况下，我只想运行将该 yaml 代码放到 GPU 节点上。我怎样才能做到这一点？

我是 Kubernetes 的新手，我已经在这方面花了很多时间但没有成功。任何帮助将不胜感激。

最好的，菲尔

更新： 好的，谢谢 Harsh 我意识到我必须像这里一样通过 Daemonset + ConfigMap： https://github.com/GoogleCloudPlatform/solutions-gke-init-daemonsets-tutorial

我的 GPU 节点有标签

gpu-type=t4

所以我已经创建并 kubectl 应用了这个 ConfigMap：

apiVersion: v1
kind: ConfigMap
metadata:
  name: phils-init-script
  labels:
    gpu-type: t4
data:
  entrypoint.sh: |
    #!/usr/bin/env bash

    ROOT_MOUNT_DIR="${ROOT_MOUNT_DIR:-/root}"

    chroot "${ROOT_MOUNT_DIR}" cos-extensions gpu install

这是我的 DaemonSet（我也 kubectl 应用了这个）：

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: phils-cos-extensions-gpu-installer
  labels:
    gpu-type: t4
spec:
  selector:
    matchLabels:
      gpu-type: t4
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: phils-cos-extensions-gpu-installer
        gpu-type: t4
    spec:
      volumes:
      - name: root-mount
        hostPath:
          path: /
      - name: phils-init-script
        configMap:
          name: phils-init-script
          defaultMode: 0744
      initContainers:
      - image: ubuntu:18.04
        name: phils-cos-extensions-gpu-installer
        command: ["/scripts/entrypoint.sh"]
        env:
        - name: ROOT_MOUNT_DIR
          value: /root
        securityContext:
          privileged: true
        volumeMounts:
        - name: root-mount
          mountPath: /root
        - name: phils-init-script
          mountPath: /scripts
      containers:
      - image: "gcr.io/google-containers/pause:2.0"
        name: pause

但没有任何反应，我收到消息“Pods 正在等待处理”。

在脚本的运行期间，我通过 ssh 连接到 GPU 节点，可以看到 ConfigMap shell 代码没有得到应用。

我在这里错过了什么？

我正在拼命努力完成这项工作。

最好的，菲尔

感谢您迄今为止的所有帮助！

Answer 1

Can i do that with kubectl apply ? Ideally I would like to only run that yaml code onto the GPU node. How can I achieve that?

是的，您可以运行每个节点上的 Deamon 集，它将运行节点上的命令。

当您在 GKE 上时，守护程序集也将运行新节点上的命令或脚本也正在扩大。

守护进程集主要用于运行在集群中的每个可用节点上部署或部署应用程序。

我们可以利用这个守护进程集和运行每个节点上的命令，也即将到来。

示例 YAML :

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-initializer
  labels:
    app: default-init
spec:
  selector:
    matchLabels:
      app: default-init
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: node-initializer
        app: default-init
    spec:
      volumes:
      - name: root-mount
        hostPath:
          path: /
      - name: entrypoint
        configMap:
          name: entrypoint
          defaultMode: 0744
      initContainers:
      - image: ubuntu:18.04
        name: node-initializer
        command: ["/scripts/entrypoint.sh"]
        env:
        - name: ROOT_MOUNT_DIR
          value: /root
        securityContext:
          privileged: true
        volumeMounts:
        - name: root-mount
          mountPath: /root
        - name: entrypoint
          mountPath: /scripts
      containers:
      - image: "gcr.io/google-containers/pause:2.0"
        name: pause

Github link 例如：https://github.com/GoogleCloudPlatform/solutions-gke-init-daemonsets-tutorial

确切的部署步骤：https://cloud.google.com/solutions/automatically-bootstrapping-gke-nodes-with-daemonsets#deploying_the_daemonset

全文：https://cloud.google.com/solutions/automatically-bootstrapping-gke-nodes-with-daemonsets

Answer 2

如果你已经安装了很多次驱动程序并且 nvidia-smi 仍然无法通信，请查看 prime-select。

运行 prime-select query，这样你会得到所有可能的选项，它必须至少显示 nvidia | intel.
Select prime-select nvidia.
然后，如果您看到 nvidia is already selected，请选择另一个，例如prime-select intel。接下来，切换回 nvidia prime-select nvidia.
重新启动并检查 nvidia-smi。

此外，再次运行可能是个好主意：

sudo apt install nvidia-cuda-toolkit

完成后，重启机器，然后nvidia-smi就可以工作了。

现在，在其他情况下，可以按照这些说明在 VM 上安装 CuDNn 和 Cuda cuda_11.2_installation_on_Ubuntu_20.04。

最后，在其他一些情况下，它是由 unattended-upgrades 引起的。查看设置并调整它们是否会导致意外结果。这个 URL 有 Debian 的文档，我看到你已经用那个发行版测试过 UnattendedUpgrades。

在 GKE (Cloud Composer) 中的自动缩放节点上安装 GPU 驱动程序

Install GPU Driver on autoscaling Node in GKE (Cloud Composer)

google-compute-engine

google-cloud-platform

kubernetes

google-kubernetes-engine

google-cloud-composer