在 GKE 上,如果未分配 nvidia.com/gpu 资源,dcgm-exporter pod 无法 运行
On GKE, dcgm-exporter pod fails to run if the nvidia.com/gpu resource is not allocated
我正在尝试查询 GKE pods 的 GPU 使用指标。
这是我为测试所做的:
- 创建了带有两个节点池的 GKE 集群,其中一个有两个 cpu-only 节点,另一个有一个带有 NVIDIA Tesla T4 GPU 的节点。所有节点 运行ning Container-Optimized OS.
- 如 https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers 中所写,我 运行
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
.
kubectl create -f dcgm-exporter.yaml
# dcgm-exporter.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: "dcgm-exporter"
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
spec:
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
template:
metadata:
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
name: "dcgm-exporter"
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-accelerator
operator: Exists
containers:
- image: "nvidia/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04"
# resources:
# limits:
# nvidia.com/gpu: "1"
env:
- name: "DCGM_EXPORTER_LISTEN"
value: ":9400"
- name: "DCGM_EXPORTER_KUBERNETES"
value: "true"
name: "dcgm-exporter"
ports:
- name: "metrics"
containerPort: 9400
securityContext:
runAsNonRoot: false
runAsUser: 0
capabilities:
add: ["SYS_ADMIN"]
volumeMounts:
- name: "pod-gpu-resources"
readOnly: true
mountPath: "/var/lib/kubelet/pod-resources"
tolerations:
- effect: "NoExecute"
operator: "Exists"
- effect: "NoSchedule"
operator: "Exists"
volumes:
- name: "pod-gpu-resources"
hostPath:
path: "/var/lib/kubelet/pod-resources"
---
kind: Service
apiVersion: v1
metadata:
name: "dcgm-exporter"
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '9400'
spec:
selector:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
ports:
- name: "metrics"
port: 9400
- pod 运行s 仅在 gpu 节点上但崩溃并出现以下错误:
time="2020-11-21T04:27:21Z" level=info msg="Starting dcgm-exporter"
Error: Failed to initialize NVML
time="2020-11-21T04:27:21Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"
取消注释 resources: limits: nvidia.com/gpu: "1"
,它成功 运行s。但是,我不希望这个 pod 占用任何 GPU,只是观看它们。
如何在不分配 GPU 的情况下 运行 dcgm-exporter?我尝试了 Ubuntu 个节点,但也失败了。
它适用于这些:
- 将
privileged: true
设置为 securityContext
。
- 添加卷挂载
"nvidia-install-dir-host"
.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: "dcgm-exporter"
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
spec:
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
template:
metadata:
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
name: "dcgm-exporter"
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-accelerator
operator: Exists
containers:
- image: "nvidia/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04"
env:
- name: "DCGM_EXPORTER_LISTEN"
value: ":9400"
- name: "DCGM_EXPORTER_KUBERNETES"
value: "true"
name: "dcgm-exporter"
ports:
- name: "metrics"
containerPort: 9400
securityContext:
privileged: true
volumeMounts:
- name: "pod-gpu-resources"
readOnly: true
mountPath: "/var/lib/kubelet/pod-resources"
- name: "nvidia-install-dir-host"
mountPath: "/usr/local/nvidia"
tolerations:
- effect: "NoExecute"
operator: "Exists"
- effect: "NoSchedule"
operator: "Exists"
volumes:
- name: "pod-gpu-resources"
hostPath:
path: "/var/lib/kubelet/pod-resources"
- name: "nvidia-install-dir-host"
hostPath:
path: "/home/kubernetes/bin/nvidia"
---
kind: Service
apiVersion: v1
metadata:
name: "dcgm-exporter"
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '9400'
spec:
selector:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
ports:
- name: "metrics"
port: 9400
我正在尝试查询 GKE pods 的 GPU 使用指标。
这是我为测试所做的:
- 创建了带有两个节点池的 GKE 集群,其中一个有两个 cpu-only 节点,另一个有一个带有 NVIDIA Tesla T4 GPU 的节点。所有节点 运行ning Container-Optimized OS.
- 如 https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers 中所写,我 运行
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
. kubectl create -f dcgm-exporter.yaml
# dcgm-exporter.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: "dcgm-exporter"
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
spec:
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
template:
metadata:
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
name: "dcgm-exporter"
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-accelerator
operator: Exists
containers:
- image: "nvidia/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04"
# resources:
# limits:
# nvidia.com/gpu: "1"
env:
- name: "DCGM_EXPORTER_LISTEN"
value: ":9400"
- name: "DCGM_EXPORTER_KUBERNETES"
value: "true"
name: "dcgm-exporter"
ports:
- name: "metrics"
containerPort: 9400
securityContext:
runAsNonRoot: false
runAsUser: 0
capabilities:
add: ["SYS_ADMIN"]
volumeMounts:
- name: "pod-gpu-resources"
readOnly: true
mountPath: "/var/lib/kubelet/pod-resources"
tolerations:
- effect: "NoExecute"
operator: "Exists"
- effect: "NoSchedule"
operator: "Exists"
volumes:
- name: "pod-gpu-resources"
hostPath:
path: "/var/lib/kubelet/pod-resources"
---
kind: Service
apiVersion: v1
metadata:
name: "dcgm-exporter"
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '9400'
spec:
selector:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
ports:
- name: "metrics"
port: 9400
- pod 运行s 仅在 gpu 节点上但崩溃并出现以下错误:
time="2020-11-21T04:27:21Z" level=info msg="Starting dcgm-exporter"
Error: Failed to initialize NVML
time="2020-11-21T04:27:21Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"
取消注释 resources: limits: nvidia.com/gpu: "1"
,它成功 运行s。但是,我不希望这个 pod 占用任何 GPU,只是观看它们。
如何在不分配 GPU 的情况下 运行 dcgm-exporter?我尝试了 Ubuntu 个节点,但也失败了。
它适用于这些:
- 将
privileged: true
设置为securityContext
。 - 添加卷挂载
"nvidia-install-dir-host"
.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: "dcgm-exporter"
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
spec:
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
template:
metadata:
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
name: "dcgm-exporter"
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-accelerator
operator: Exists
containers:
- image: "nvidia/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04"
env:
- name: "DCGM_EXPORTER_LISTEN"
value: ":9400"
- name: "DCGM_EXPORTER_KUBERNETES"
value: "true"
name: "dcgm-exporter"
ports:
- name: "metrics"
containerPort: 9400
securityContext:
privileged: true
volumeMounts:
- name: "pod-gpu-resources"
readOnly: true
mountPath: "/var/lib/kubelet/pod-resources"
- name: "nvidia-install-dir-host"
mountPath: "/usr/local/nvidia"
tolerations:
- effect: "NoExecute"
operator: "Exists"
- effect: "NoSchedule"
operator: "Exists"
volumes:
- name: "pod-gpu-resources"
hostPath:
path: "/var/lib/kubelet/pod-resources"
- name: "nvidia-install-dir-host"
hostPath:
path: "/home/kubernetes/bin/nvidia"
---
kind: Service
apiVersion: v1
metadata:
name: "dcgm-exporter"
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '9400'
spec:
selector:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
ports:
- name: "metrics"
port: 9400