在 kubernetes 1.10 中调度 gpus

scheduling gpus in kubernetes 1.10

我按照说明安装 nvidia-docker 2,然后通过 kubeadm(在 rhel7 上)安装了 kubernetes 1.10:我做了以下操作:

curl -s -L https://nvidia.github.io/nvidia-docker/rhel7.4/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
yum update

yum install docker

yum install -y nvidia-container-runtime-hook

yum install --downloadonly --downloaddir=/tmp/  nvidia-docker2-2.0.3-1.docker1.13.1.noarch nvidia-container-runtime-2.0.0-1.docker1.13.1.x86_64
rpm -Uhv --replacefiles /tmp/nvidia-container-runtime-2.0.0-1.docker1.13.1.x86_64.rpm /tmp/nvidia-docker2-2.0.3-1.docker1.13.1.noarch.rpm

mkdir -p  /etc/systemd/system/docker.service.d/
cat <<EOF > /etc/systemd/system/docker.service.d/override.conf
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd-current --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd --userland-proxy-path=/usr/libexec/docker/docker-proxy-current --seccomp-profile=/etc/docker/seccomp.json $OPTIONS $DOCKER_STORAGE_OPTIONS $DOCKER_NETWORK_OPTIONS $ADD_REGISTRY $BLOCK_REGISTRY $INSECURE_REGISTRY $REGISTRIES
EOF

cat <<EOF > /etc/docker/daemon.json
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}
EOF

systemctl restart docker

docker run --rm nvidia/cuda nvidia-smi
# success!

我什至可以安排 gpu 容器并从容器中查看所有 gpu。

然而,当我部署容器时:

resources:
    limits:
        nvidia.com/gpu: 1

pods 保持为:

jupyter         jupyterlab-gpu                 0/1       Pending     0          1m        <none>           <none>

描述显示:

Name:         jupyterlab-gpu
Namespace:    jupyter
Node:         <none>
Labels:       app=jupyterhub
              component=singleuser-server
              heritage=jupyterhub
              hub.jupyter.org/username=me
Annotations:  <none>
Status:       Pending
IP:
Containers:
  notebook:
    Image:      slaclab/slac-jupyterlab-gpu
    Port:       8888/TCP
    Host Port:  0/TCP
    Limits:
      cpu:             2
      memory:          2147483648
      nvidia.com/gpu:  1
    Requests:
      cpu:             500m
      memory:          536870912
      nvidia.com/gpu:  1
    Environment:
      JUPYTERHUB_USER:                me
      JUPYTERLAB_IDLE_TIMEOUT:        43200
      JPY_API_TOKEN:                  1fca7b3d716e4d54a98d8054d17b16fb
      CPU_LIMIT:                      2.0
      JUPYTERHUB_SERVICE_PREFIX:      /user/me/
      MEM_GUARANTEE:                  536870912
      JUPYTERHUB_API_URL:             http://10.103.19.59:8081/hub/api
      JUPYTERHUB_OAUTH_CALLBACK_URL:  /user/me/oauth_callback
      JUPYTERHUB_BASE_URL:            /
      JUPYTERHUB_API_TOKEN:           1fca7b3d716e4d54a98d8054d17b16fb
      CPU_GUARANTEE:                  0.5
      JUPYTERHUB_CLIENT_ID:           user-me
      MEM_LIMIT:                      2147483648
      JUPYTERHUB_HOST:
    Mounts:
      /home/ from generic-user-home (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from no-api-access-please (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  generic-user-home:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  generic-user-home
    ReadOnly:   false
  no-api-access-please:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
QoS Class:       Burstable
Node-Selectors:  group=gpu
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  14s (x13 over 2m)  default-scheduler  0/8 nodes are available: 1 node(s) were not ready, 6 node(s) didn't match node selector, 7 Insufficient nvidia.com/gpu.

我能够毫无问题地将容器调度到没有 gpu 资源限制的节点。

有什么方法可以验证 kubectl (?) 可以 'see' gpus 吗?

您可以通过 kubectl get nodes -oyaml 查看节点详细信息,nvidia.com/gpu 资源将列在 status.allocatablestatus.capacity 以及 cpu 和内存 [=14] 下=]