在 kubernetes 1.10 中调度 gpus
scheduling gpus in kubernetes 1.10
我按照说明安装 nvidia-docker 2,然后通过 kubeadm(在 rhel7 上)安装了 kubernetes 1.10:我做了以下操作:
curl -s -L https://nvidia.github.io/nvidia-docker/rhel7.4/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
yum update
yum install docker
yum install -y nvidia-container-runtime-hook
yum install --downloadonly --downloaddir=/tmp/ nvidia-docker2-2.0.3-1.docker1.13.1.noarch nvidia-container-runtime-2.0.0-1.docker1.13.1.x86_64
rpm -Uhv --replacefiles /tmp/nvidia-container-runtime-2.0.0-1.docker1.13.1.x86_64.rpm /tmp/nvidia-docker2-2.0.3-1.docker1.13.1.noarch.rpm
mkdir -p /etc/systemd/system/docker.service.d/
cat <<EOF > /etc/systemd/system/docker.service.d/override.conf
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd-current --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd --userland-proxy-path=/usr/libexec/docker/docker-proxy-current --seccomp-profile=/etc/docker/seccomp.json $OPTIONS $DOCKER_STORAGE_OPTIONS $DOCKER_NETWORK_OPTIONS $ADD_REGISTRY $BLOCK_REGISTRY $INSECURE_REGISTRY $REGISTRIES
EOF
cat <<EOF > /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
EOF
systemctl restart docker
docker run --rm nvidia/cuda nvidia-smi
# success!
我什至可以安排 gpu 容器并从容器中查看所有 gpu。
然而,当我部署容器时:
resources:
limits:
nvidia.com/gpu: 1
pods 保持为:
jupyter jupyterlab-gpu 0/1 Pending 0 1m <none> <none>
描述显示:
Name: jupyterlab-gpu
Namespace: jupyter
Node: <none>
Labels: app=jupyterhub
component=singleuser-server
heritage=jupyterhub
hub.jupyter.org/username=me
Annotations: <none>
Status: Pending
IP:
Containers:
notebook:
Image: slaclab/slac-jupyterlab-gpu
Port: 8888/TCP
Host Port: 0/TCP
Limits:
cpu: 2
memory: 2147483648
nvidia.com/gpu: 1
Requests:
cpu: 500m
memory: 536870912
nvidia.com/gpu: 1
Environment:
JUPYTERHUB_USER: me
JUPYTERLAB_IDLE_TIMEOUT: 43200
JPY_API_TOKEN: 1fca7b3d716e4d54a98d8054d17b16fb
CPU_LIMIT: 2.0
JUPYTERHUB_SERVICE_PREFIX: /user/me/
MEM_GUARANTEE: 536870912
JUPYTERHUB_API_URL: http://10.103.19.59:8081/hub/api
JUPYTERHUB_OAUTH_CALLBACK_URL: /user/me/oauth_callback
JUPYTERHUB_BASE_URL: /
JUPYTERHUB_API_TOKEN: 1fca7b3d716e4d54a98d8054d17b16fb
CPU_GUARANTEE: 0.5
JUPYTERHUB_CLIENT_ID: user-me
MEM_LIMIT: 2147483648
JUPYTERHUB_HOST:
Mounts:
/home/ from generic-user-home (rw)
/var/run/secrets/kubernetes.io/serviceaccount from no-api-access-please (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
generic-user-home:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: generic-user-home
ReadOnly: false
no-api-access-please:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
QoS Class: Burstable
Node-Selectors: group=gpu
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 14s (x13 over 2m) default-scheduler 0/8 nodes are available: 1 node(s) were not ready, 6 node(s) didn't match node selector, 7 Insufficient nvidia.com/gpu.
我能够毫无问题地将容器调度到没有 gpu 资源限制的节点。
有什么方法可以验证 kubectl (?) 可以 'see' gpus 吗?
您可以通过 kubectl get nodes -oyaml
查看节点详细信息,nvidia.com/gpu
资源将列在 status.allocatable
和 status.capacity
以及 cpu 和内存 [=14] 下=]
我按照说明安装 nvidia-docker 2,然后通过 kubeadm(在 rhel7 上)安装了 kubernetes 1.10:我做了以下操作:
curl -s -L https://nvidia.github.io/nvidia-docker/rhel7.4/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
yum update
yum install docker
yum install -y nvidia-container-runtime-hook
yum install --downloadonly --downloaddir=/tmp/ nvidia-docker2-2.0.3-1.docker1.13.1.noarch nvidia-container-runtime-2.0.0-1.docker1.13.1.x86_64
rpm -Uhv --replacefiles /tmp/nvidia-container-runtime-2.0.0-1.docker1.13.1.x86_64.rpm /tmp/nvidia-docker2-2.0.3-1.docker1.13.1.noarch.rpm
mkdir -p /etc/systemd/system/docker.service.d/
cat <<EOF > /etc/systemd/system/docker.service.d/override.conf
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd-current --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd --userland-proxy-path=/usr/libexec/docker/docker-proxy-current --seccomp-profile=/etc/docker/seccomp.json $OPTIONS $DOCKER_STORAGE_OPTIONS $DOCKER_NETWORK_OPTIONS $ADD_REGISTRY $BLOCK_REGISTRY $INSECURE_REGISTRY $REGISTRIES
EOF
cat <<EOF > /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
EOF
systemctl restart docker
docker run --rm nvidia/cuda nvidia-smi
# success!
我什至可以安排 gpu 容器并从容器中查看所有 gpu。
然而,当我部署容器时:
resources:
limits:
nvidia.com/gpu: 1
pods 保持为:
jupyter jupyterlab-gpu 0/1 Pending 0 1m <none> <none>
描述显示:
Name: jupyterlab-gpu
Namespace: jupyter
Node: <none>
Labels: app=jupyterhub
component=singleuser-server
heritage=jupyterhub
hub.jupyter.org/username=me
Annotations: <none>
Status: Pending
IP:
Containers:
notebook:
Image: slaclab/slac-jupyterlab-gpu
Port: 8888/TCP
Host Port: 0/TCP
Limits:
cpu: 2
memory: 2147483648
nvidia.com/gpu: 1
Requests:
cpu: 500m
memory: 536870912
nvidia.com/gpu: 1
Environment:
JUPYTERHUB_USER: me
JUPYTERLAB_IDLE_TIMEOUT: 43200
JPY_API_TOKEN: 1fca7b3d716e4d54a98d8054d17b16fb
CPU_LIMIT: 2.0
JUPYTERHUB_SERVICE_PREFIX: /user/me/
MEM_GUARANTEE: 536870912
JUPYTERHUB_API_URL: http://10.103.19.59:8081/hub/api
JUPYTERHUB_OAUTH_CALLBACK_URL: /user/me/oauth_callback
JUPYTERHUB_BASE_URL: /
JUPYTERHUB_API_TOKEN: 1fca7b3d716e4d54a98d8054d17b16fb
CPU_GUARANTEE: 0.5
JUPYTERHUB_CLIENT_ID: user-me
MEM_LIMIT: 2147483648
JUPYTERHUB_HOST:
Mounts:
/home/ from generic-user-home (rw)
/var/run/secrets/kubernetes.io/serviceaccount from no-api-access-please (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
generic-user-home:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: generic-user-home
ReadOnly: false
no-api-access-please:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
QoS Class: Burstable
Node-Selectors: group=gpu
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 14s (x13 over 2m) default-scheduler 0/8 nodes are available: 1 node(s) were not ready, 6 node(s) didn't match node selector, 7 Insufficient nvidia.com/gpu.
我能够毫无问题地将容器调度到没有 gpu 资源限制的节点。
有什么方法可以验证 kubectl (?) 可以 'see' gpus 吗?
您可以通过 kubectl get nodes -oyaml
查看节点详细信息,nvidia.com/gpu
资源将列在 status.allocatable
和 status.capacity
以及 cpu 和内存 [=14] 下=]