GPU 没有出现在 GKE 节点上,即使它们出现在 GKE NodePool 中
GPU's not showing up on GKE Node even though they show up in GKE NodePool
我正在尝试设置一个 Google Kubernetes Engine 集群,节点中的 GPU 松散地遵循 these instructions,因为我正在使用 Python 客户端以编程方式进行部署。
出于某种原因,我可以使用包含 GPU 的 NodePool 创建集群
...但是,NodePool 中的节点无法访问这些 GPU。
我已经用这个 yaml 文件安装了 NVIDIA DaemonSet:
https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
您可以在这张图片中看到它:
出于某种原因,这两行似乎总是处于“ContainerCreating”和“PodInitializing”状态。他们从不将绿色翻转为状态 =“运行”。如何让 NodePool 中的 GPU 在节点中可用?
更新:
根据评论,我 运行 在 2 NVIDIA pods 上执行以下命令; kubectl describe pod POD_NAME --namespace kube-system
.
为此,我在节点上打开了 UI KUBECTL 命令终端。然后我 运行 以下命令:
gcloud container clusters get-credentials CLUSTER-NAME --zone ZONE --project PROJECT-NAME
然后,我调用了 kubectl describe pod nvidia-gpu-device-plugin-UID --namespace kube-system
并得到了这个输出:
Name: nvidia-gpu-device-plugin-UID
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: gke-mycluster-clust-default-pool-26403abb-zqz6/X.X.X.X
Start Time: Wed, 02 Mar 2022 20:19:49 +0000
Labels: controller-revision-hash=79765599fc
k8s-app=nvidia-gpu-device-plugin
pod-template-generation=1
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: DaemonSet/nvidia-gpu-device-plugin
Containers:
nvidia-gpu-device-plugin:
Container ID:
Image: gcr.io/gke-release/nvidia-gpu-device-plugin@sha256:aa80c85c274a8e8f78110cae33cc92240d2f9b7efb3f53212f1cefd03de3c317
Image ID:
Port: 2112/TCP
Host Port: 0/TCP
Command:
/usr/bin/nvidia-gpu-device-plugin
-logtostderr
--enable-container-gpu-metrics
--enable-health-monitoring
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Limits:
cpu: 50m
memory: 50Mi
Requests:
cpu: 50m
memory: 20Mi
Environment:
LD_LIBRARY_PATH: /usr/local/nvidia/lib64
Mounts:
/dev from dev (rw)
/device-plugin from device-plugin (rw)
/etc/nvidia from nvidia-config (rw)
/proc from proc (rw)
/usr/local/nvidia from nvidia (rw)
/var/lib/kubelet/pod-resources from pod-resources (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-qnxjr (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
dev:
Type: HostPath (bare host directory volume)
Path: /dev
HostPathType:
nvidia:
Type: HostPath (bare host directory volume)
Path: /home/kubernetes/bin/nvidia
HostPathType: Directory
pod-resources:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/pod-resources
HostPathType:
proc:
Type: HostPath (bare host directory volume)
Path: /proc
HostPathType:
nvidia-config:
Type: HostPath (bare host directory volume)
Path: /etc/nvidia
HostPathType:
default-token-qnxjr:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-qnxjr
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: :NoExecute op=Exists
:NoSchedule op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m55s default-scheduler Successfully assigned kube-system/nvidia-gpu-device-plugin-hxdwx to gke-opcode-trainer-clust-default-pool-26403abb-zqz6
Warning FailedMount 6m42s kubelet Unable to attach or mount volumes: unmounted volumes=[nvidia], unattached volumes=[nvidia-config default-token-qnxjr device-plugin dev nvidia pod-resources proc]: timed out waiting for the condition
Warning FailedMount 4m25s kubelet Unable to attach or mount volumes: unmounted volumes=[nvidia], unattached volumes=[proc nvidia-config default-token-qnxjr device-plugin dev nvidia pod-resources]: timed out waiting for the condition
Warning FailedMount 2m11s kubelet Unable to attach or mount volumes: unmounted volumes=[nvidia], unattached volumes=[pod-resources proc nvidia-config default-token-qnxjr device-plugin dev nvidia]: timed out waiting for the condition
Warning FailedMount 31s (x12 over 8m45s) kubelet MountVolume.SetUp failed for volume "nvidia" : hostPath type check failed: /home/kubernetes/bin/nvidia is not a directory
然后,我调用了 kubectl describe pod nvidia-driver-installer-UID --namespace kube-system
并得到了这个输出:
Name: nvidia-driver-installer-UID
Namespace: kube-system
Priority: 0
Node: gke-mycluster-clust-default-pool-26403abb-zqz6/X.X.X.X
Start Time: Wed, 02 Mar 2022 20:20:06 +0000
Labels: controller-revision-hash=6bbfc44f6d
k8s-app=nvidia-driver-installer
name=nvidia-driver-installer
pod-template-generation=1
Annotations: <none>
Status: Pending
IP: 10.56.0.9
IPs:
IP: 10.56.0.9
Controlled By: DaemonSet/nvidia-driver-installer
Init Containers:
nvidia-driver-installer:
Container ID:
Image: gke-nvidia-installer:fixed
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: ImagePullBackOff
Ready: False
Restart Count: 0
Requests:
cpu: 150m
Environment: <none>
Mounts:
/boot from boot (rw)
/dev from dev (rw)
/root from root-mount (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-qnxjr (ro)
Containers:
pause:
Container ID:
Image: gcr.io/google-containers/pause:2.0
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-qnxjr (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
dev:
Type: HostPath (bare host directory volume)
Path: /dev
HostPathType:
boot:
Type: HostPath (bare host directory volume)
Path: /boot
HostPathType:
root-mount:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
default-token-qnxjr:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-qnxjr
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m20s default-scheduler Successfully assigned kube-system/nvidia-driver-installer-tzw42 to gke-opcode-trainer-clust-default-pool-26403abb-zqz6
Normal Pulling 2m36s (x4 over 4m19s) kubelet Pulling image "gke-nvidia-installer:fixed"
Warning Failed 2m34s (x4 over 4m10s) kubelet Failed to pull image "gke-nvidia-installer:fixed": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/gke-nvidia-installer:fixed": failed to resolve reference "docker.io/library/gke-nvidia-installer:fixed": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed
Warning Failed 2m34s (x4 over 4m10s) kubelet Error: ErrImagePull
Warning Failed 2m22s (x6 over 4m9s) kubelet Error: ImagePullBackOff
Normal BackOff 2m7s (x7 over 4m9s) kubelet Back-off pulling image "gke-nvidia-installer:fixed"
根据容器试图拉取的 docker 映像 (gke-nvidia-installer:fixed
),看起来您正在尝试使用 Ubuntu daemonset 而不是 cos
。
你应该运行kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
这将为您的 cos
节点池应用正确的守护程序集,如 here 所述。
此外,请确认您的节点池具有拉取映像所需的 https://www.googleapis.com/auth/devstorage.read_only
范围。您应该可以在 GCP 控制台的节点池页面中看到它,在安全 -> 访问范围下(相关服务是存储)。
我正在尝试设置一个 Google Kubernetes Engine 集群,节点中的 GPU 松散地遵循 these instructions,因为我正在使用 Python 客户端以编程方式进行部署。
出于某种原因,我可以使用包含 GPU 的 NodePool 创建集群
...但是,NodePool 中的节点无法访问这些 GPU。
我已经用这个 yaml 文件安装了 NVIDIA DaemonSet: https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
您可以在这张图片中看到它:
出于某种原因,这两行似乎总是处于“ContainerCreating”和“PodInitializing”状态。他们从不将绿色翻转为状态 =“运行”。如何让 NodePool 中的 GPU 在节点中可用?
更新:
根据评论,我 运行 在 2 NVIDIA pods 上执行以下命令; kubectl describe pod POD_NAME --namespace kube-system
.
为此,我在节点上打开了 UI KUBECTL 命令终端。然后我 运行 以下命令:
gcloud container clusters get-credentials CLUSTER-NAME --zone ZONE --project PROJECT-NAME
然后,我调用了 kubectl describe pod nvidia-gpu-device-plugin-UID --namespace kube-system
并得到了这个输出:
Name: nvidia-gpu-device-plugin-UID
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: gke-mycluster-clust-default-pool-26403abb-zqz6/X.X.X.X
Start Time: Wed, 02 Mar 2022 20:19:49 +0000
Labels: controller-revision-hash=79765599fc
k8s-app=nvidia-gpu-device-plugin
pod-template-generation=1
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: DaemonSet/nvidia-gpu-device-plugin
Containers:
nvidia-gpu-device-plugin:
Container ID:
Image: gcr.io/gke-release/nvidia-gpu-device-plugin@sha256:aa80c85c274a8e8f78110cae33cc92240d2f9b7efb3f53212f1cefd03de3c317
Image ID:
Port: 2112/TCP
Host Port: 0/TCP
Command:
/usr/bin/nvidia-gpu-device-plugin
-logtostderr
--enable-container-gpu-metrics
--enable-health-monitoring
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Limits:
cpu: 50m
memory: 50Mi
Requests:
cpu: 50m
memory: 20Mi
Environment:
LD_LIBRARY_PATH: /usr/local/nvidia/lib64
Mounts:
/dev from dev (rw)
/device-plugin from device-plugin (rw)
/etc/nvidia from nvidia-config (rw)
/proc from proc (rw)
/usr/local/nvidia from nvidia (rw)
/var/lib/kubelet/pod-resources from pod-resources (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-qnxjr (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
dev:
Type: HostPath (bare host directory volume)
Path: /dev
HostPathType:
nvidia:
Type: HostPath (bare host directory volume)
Path: /home/kubernetes/bin/nvidia
HostPathType: Directory
pod-resources:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/pod-resources
HostPathType:
proc:
Type: HostPath (bare host directory volume)
Path: /proc
HostPathType:
nvidia-config:
Type: HostPath (bare host directory volume)
Path: /etc/nvidia
HostPathType:
default-token-qnxjr:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-qnxjr
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: :NoExecute op=Exists
:NoSchedule op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m55s default-scheduler Successfully assigned kube-system/nvidia-gpu-device-plugin-hxdwx to gke-opcode-trainer-clust-default-pool-26403abb-zqz6
Warning FailedMount 6m42s kubelet Unable to attach or mount volumes: unmounted volumes=[nvidia], unattached volumes=[nvidia-config default-token-qnxjr device-plugin dev nvidia pod-resources proc]: timed out waiting for the condition
Warning FailedMount 4m25s kubelet Unable to attach or mount volumes: unmounted volumes=[nvidia], unattached volumes=[proc nvidia-config default-token-qnxjr device-plugin dev nvidia pod-resources]: timed out waiting for the condition
Warning FailedMount 2m11s kubelet Unable to attach or mount volumes: unmounted volumes=[nvidia], unattached volumes=[pod-resources proc nvidia-config default-token-qnxjr device-plugin dev nvidia]: timed out waiting for the condition
Warning FailedMount 31s (x12 over 8m45s) kubelet MountVolume.SetUp failed for volume "nvidia" : hostPath type check failed: /home/kubernetes/bin/nvidia is not a directory
然后,我调用了 kubectl describe pod nvidia-driver-installer-UID --namespace kube-system
并得到了这个输出:
Name: nvidia-driver-installer-UID
Namespace: kube-system
Priority: 0
Node: gke-mycluster-clust-default-pool-26403abb-zqz6/X.X.X.X
Start Time: Wed, 02 Mar 2022 20:20:06 +0000
Labels: controller-revision-hash=6bbfc44f6d
k8s-app=nvidia-driver-installer
name=nvidia-driver-installer
pod-template-generation=1
Annotations: <none>
Status: Pending
IP: 10.56.0.9
IPs:
IP: 10.56.0.9
Controlled By: DaemonSet/nvidia-driver-installer
Init Containers:
nvidia-driver-installer:
Container ID:
Image: gke-nvidia-installer:fixed
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: ImagePullBackOff
Ready: False
Restart Count: 0
Requests:
cpu: 150m
Environment: <none>
Mounts:
/boot from boot (rw)
/dev from dev (rw)
/root from root-mount (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-qnxjr (ro)
Containers:
pause:
Container ID:
Image: gcr.io/google-containers/pause:2.0
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-qnxjr (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
dev:
Type: HostPath (bare host directory volume)
Path: /dev
HostPathType:
boot:
Type: HostPath (bare host directory volume)
Path: /boot
HostPathType:
root-mount:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
default-token-qnxjr:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-qnxjr
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m20s default-scheduler Successfully assigned kube-system/nvidia-driver-installer-tzw42 to gke-opcode-trainer-clust-default-pool-26403abb-zqz6
Normal Pulling 2m36s (x4 over 4m19s) kubelet Pulling image "gke-nvidia-installer:fixed"
Warning Failed 2m34s (x4 over 4m10s) kubelet Failed to pull image "gke-nvidia-installer:fixed": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/gke-nvidia-installer:fixed": failed to resolve reference "docker.io/library/gke-nvidia-installer:fixed": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed
Warning Failed 2m34s (x4 over 4m10s) kubelet Error: ErrImagePull
Warning Failed 2m22s (x6 over 4m9s) kubelet Error: ImagePullBackOff
Normal BackOff 2m7s (x7 over 4m9s) kubelet Back-off pulling image "gke-nvidia-installer:fixed"
根据容器试图拉取的 docker 映像 (gke-nvidia-installer:fixed
),看起来您正在尝试使用 Ubuntu daemonset 而不是 cos
。
你应该运行kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
这将为您的 cos
节点池应用正确的守护程序集,如 here 所述。
此外,请确认您的节点池具有拉取映像所需的 https://www.googleapis.com/auth/devstorage.read_only
范围。您应该可以在 GCP 控制台的节点池页面中看到它,在安全 -> 访问范围下(相关服务是存储)。