如何在 AWS 上使用自动缩放器扩展 GPU 节点
How to scale up GPU Nodes with autoscaler on AWS
我想要一个从 0 扩展到 x pods 的实例组。我得到 Insufficient nvidia.com/gpu
。有人看到我在这里做错了吗?这是在带有自动缩放器 1.1.2 的 Kubernetes v1.9.6 上。
我有两个实例组,一个有 cpus,一个新的我想缩小到 0 个节点,叫做 gpus,所以 kops edit ig gpus
是:
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2018-05-31T09:27:31Z
labels:
kops.k8s.io/cluster: ci.k8s.local
name: gpus
spec:
cloudLabels:
instancegroup: gpus
k8s.io/cluster-autoscaler/enabled: ""
image: ami-4450543d
kubelet:
featureGates:
DevicePlugins: "true"
machineType: p2.xlarge
maxPrice: "0.5"
maxSize: 3
minSize: 0
nodeLabels:
kops.k8s.io/instancegroup: gpus
role: Node
rootVolumeOptimization: true
subnets:
- eu-west-1c
自动缩放器部署有:
spec:
containers:
- command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --nodes=0:3:gpus.ci.k8s.local
env:
- name: AWS_REGION
value: eu-west-1
image: k8s.gcr.io/cluster-autoscaler:v1.1.2
现在我尝试部署一个简单的 GPU 测试:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: simple-gpu-test
spec:
replicas: 1
template:
metadata:
labels:
app: "simplegputest"
spec:
containers:
- name: "nvidia-smi-gpu"
image: "nvidia/cuda:8.0-cudnn5-runtime"
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
volumeMounts:
- mountPath: /usr/local/nvidia
name: nvidia
command: [ "/bin/bash", "-c", "--" ]
args: [ "while true; do nvidia-smi; sleep 5; done;" ]
volumes:
- hostPath:
path: /usr/local/nvidia
name: nvidia
我希望实例组从 0 变为 1,但自动缩放器日志显示:
I0605 11:27:29.865576 1 scale_up.go:54] Pod default/simple-gpu-test-6f48d9555d-l9822 is unschedulable
I0605 11:27:29.961051 1 scale_up.go:86] Upcoming 0 nodes
I0605 11:27:30.005163 1 scale_up.go:146] Scale-up predicate failed: PodFitsResources predicate mismatch, cannot put default/simple-gpu-test-6f48d9555d-l9822 on template-node-for-gpus.ci.k8s.local-5829202798403814789, reason: Insufficient nvidia.com/gpu
I0605 11:27:30.005262 1 scale_up.go:175] No pod can fit to gpus.ci.k8s.local
I0605 11:27:30.005324 1 scale_up.go:180] No expansion options
I0605 11:27:30.005393 1 static_autoscaler.go:299] Calculating unneeded nodes
I0605 11:27:30.008919 1 factory.go:33] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"simple-gpu-test-6f48d9555d-l9822", UID:"3416d787-68b3-11e8-8e8f-0639a6e973b0", APIVersion:"v1", ResourceVersion:"12429157", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added)
I0605 11:27:30.031707 1 leaderelection.go:199] successfully renewed lease kube-system/cluster-autoscaler
当我通过设置最小 tot 1 启动一个节点时,我看到它有容量:
Capacity:
cpu: 4
memory: 62884036Ki
nvidia.com/gpu: 1
pods: 110
and labels
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=p2.xlarge
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=eu-west-1
failure-domain.beta.kubernetes.io/zone=eu-west-1c
kops.k8s.io/instancegroup=gpus
kubernetes.io/role=node
node-role.kubernetes.io/node=
spot=true
AWS Scale 组中存在所需的标签:
{
"ResourceId": "gpus.ci.k8s.local",
"ResourceType": "auto-scaling-group",
"Key": "k8s.io/cluster-autoscaler/node-template/label/kops.k8s.io/instancegroup",
"Value": "gpus",
"PropagateAtLaunch": true
}
最后,当我将最小池大小设置为 1 时,它可以自动从 1 扩展到 3。只是不做 0 到 1。
有没有什么方法可以让我检查模板,看看为什么它没有资源?
Cluster Autoscaler
是一个独立的程序,可以调整 Kubernetes 集群的大小以满足当前的需求。
Cluster Autoscaler 可以以相同的方式管理云提供商提供的 GPU 资源。
基于集群autoscaler documentation,
对于 AWS,可以将节点组扩展到 0(显然从 0 开始),前提是满足所有缩小条件。
回到您的问题,对于 AWS,如果您使用的是 nodeSelector,则需要使用 "k8s.io/cluster-autoscaler/node-template/label/" 等标签在 ASG 模板中标记您的节点。
请注意,Kubernetes 和 AWS GPU 支持需要不同的标签。
例如,对于 foo=bar 的节点标签,您可以将 ASG 标记为:
{
"ResourceType": "auto-scaling-group",
"ResourceId": "foo.example.com",
"PropagateAtLaunch": true,
"Value": "bar",
"Key": "k8s.io/cluster-autoscaler/node-template/label/foo"
}
还要确保 AWS 账户中所需实例类型的限制为非零。
我想要一个从 0 扩展到 x pods 的实例组。我得到 Insufficient nvidia.com/gpu
。有人看到我在这里做错了吗?这是在带有自动缩放器 1.1.2 的 Kubernetes v1.9.6 上。
我有两个实例组,一个有 cpus,一个新的我想缩小到 0 个节点,叫做 gpus,所以 kops edit ig gpus
是:
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2018-05-31T09:27:31Z
labels:
kops.k8s.io/cluster: ci.k8s.local
name: gpus
spec:
cloudLabels:
instancegroup: gpus
k8s.io/cluster-autoscaler/enabled: ""
image: ami-4450543d
kubelet:
featureGates:
DevicePlugins: "true"
machineType: p2.xlarge
maxPrice: "0.5"
maxSize: 3
minSize: 0
nodeLabels:
kops.k8s.io/instancegroup: gpus
role: Node
rootVolumeOptimization: true
subnets:
- eu-west-1c
自动缩放器部署有:
spec:
containers:
- command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --nodes=0:3:gpus.ci.k8s.local
env:
- name: AWS_REGION
value: eu-west-1
image: k8s.gcr.io/cluster-autoscaler:v1.1.2
现在我尝试部署一个简单的 GPU 测试:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: simple-gpu-test
spec:
replicas: 1
template:
metadata:
labels:
app: "simplegputest"
spec:
containers:
- name: "nvidia-smi-gpu"
image: "nvidia/cuda:8.0-cudnn5-runtime"
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
volumeMounts:
- mountPath: /usr/local/nvidia
name: nvidia
command: [ "/bin/bash", "-c", "--" ]
args: [ "while true; do nvidia-smi; sleep 5; done;" ]
volumes:
- hostPath:
path: /usr/local/nvidia
name: nvidia
我希望实例组从 0 变为 1,但自动缩放器日志显示:
I0605 11:27:29.865576 1 scale_up.go:54] Pod default/simple-gpu-test-6f48d9555d-l9822 is unschedulable
I0605 11:27:29.961051 1 scale_up.go:86] Upcoming 0 nodes
I0605 11:27:30.005163 1 scale_up.go:146] Scale-up predicate failed: PodFitsResources predicate mismatch, cannot put default/simple-gpu-test-6f48d9555d-l9822 on template-node-for-gpus.ci.k8s.local-5829202798403814789, reason: Insufficient nvidia.com/gpu
I0605 11:27:30.005262 1 scale_up.go:175] No pod can fit to gpus.ci.k8s.local
I0605 11:27:30.005324 1 scale_up.go:180] No expansion options
I0605 11:27:30.005393 1 static_autoscaler.go:299] Calculating unneeded nodes
I0605 11:27:30.008919 1 factory.go:33] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"simple-gpu-test-6f48d9555d-l9822", UID:"3416d787-68b3-11e8-8e8f-0639a6e973b0", APIVersion:"v1", ResourceVersion:"12429157", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added)
I0605 11:27:30.031707 1 leaderelection.go:199] successfully renewed lease kube-system/cluster-autoscaler
当我通过设置最小 tot 1 启动一个节点时,我看到它有容量:
Capacity:
cpu: 4
memory: 62884036Ki
nvidia.com/gpu: 1
pods: 110
and labels
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=p2.xlarge
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=eu-west-1
failure-domain.beta.kubernetes.io/zone=eu-west-1c
kops.k8s.io/instancegroup=gpus
kubernetes.io/role=node
node-role.kubernetes.io/node=
spot=true
AWS Scale 组中存在所需的标签:
{
"ResourceId": "gpus.ci.k8s.local",
"ResourceType": "auto-scaling-group",
"Key": "k8s.io/cluster-autoscaler/node-template/label/kops.k8s.io/instancegroup",
"Value": "gpus",
"PropagateAtLaunch": true
}
最后,当我将最小池大小设置为 1 时,它可以自动从 1 扩展到 3。只是不做 0 到 1。
有没有什么方法可以让我检查模板,看看为什么它没有资源?
Cluster Autoscaler 是一个独立的程序,可以调整 Kubernetes 集群的大小以满足当前的需求。 Cluster Autoscaler 可以以相同的方式管理云提供商提供的 GPU 资源。
基于集群autoscaler documentation, 对于 AWS,可以将节点组扩展到 0(显然从 0 开始),前提是满足所有缩小条件。
回到您的问题,对于 AWS,如果您使用的是 nodeSelector,则需要使用 "k8s.io/cluster-autoscaler/node-template/label/" 等标签在 ASG 模板中标记您的节点。 请注意,Kubernetes 和 AWS GPU 支持需要不同的标签。
例如,对于 foo=bar 的节点标签,您可以将 ASG 标记为:
{
"ResourceType": "auto-scaling-group",
"ResourceId": "foo.example.com",
"PropagateAtLaunch": true,
"Value": "bar",
"Key": "k8s.io/cluster-autoscaler/node-template/label/foo"
}
还要确保 AWS 账户中所需实例类型的限制为非零。