在 Google Kubernetes Engine 上使用 Horizontal Pod Autoscaler 失败并显示:无法读取所有指标
Using Horizontal Pod Autoscaler on Google Kubernetes Engine fails with: Unable to read all metrics
我正在尝试设置 Horizontal Pod Autoscaler 以根据 CPU 使用情况自动扩大和缩小我的 api 服务器 pods。
我的 API 目前有 12 pods 运行,但他们正在使用 ~0% CPU.
kubectl get pods
NAME READY STATUS RESTARTS AGE
api-server-deployment-578f8d8649-4cbtc 2/2 Running 2 12h
api-server-deployment-578f8d8649-8cv77 2/2 Running 2 12h
api-server-deployment-578f8d8649-c8tv2 2/2 Running 1 12h
api-server-deployment-578f8d8649-d8c6r 2/2 Running 2 12h
api-server-deployment-578f8d8649-lvbgn 2/2 Running 1 12h
api-server-deployment-578f8d8649-lzjmj 2/2 Running 2 12h
api-server-deployment-578f8d8649-nztck 2/2 Running 1 12h
api-server-deployment-578f8d8649-q25xb 2/2 Running 2 12h
api-server-deployment-578f8d8649-tx75t 2/2 Running 1 12h
api-server-deployment-578f8d8649-wbzzh 2/2 Running 2 12h
api-server-deployment-578f8d8649-wtddv 2/2 Running 1 12h
api-server-deployment-578f8d8649-x95gq 2/2 Running 2 12h
model-server-deployment-76d466dffc-4g2nd 1/1 Running 0 23h
model-server-deployment-76d466dffc-9pqw5 1/1 Running 0 23h
model-server-deployment-76d466dffc-d29fx 1/1 Running 0 23h
model-server-deployment-76d466dffc-frrgn 1/1 Running 0 23h
model-server-deployment-76d466dffc-sfh45 1/1 Running 0 23h
model-server-deployment-76d466dffc-w2hqj 1/1 Running 0 23h
我的 api_hpa.yaml 看起来像:
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server-deployment
minReplicas: 4
maxReplicas: 12
targetCPUUtilizationPercentage: 50
现在已经 24 小时了,HPA 仍然没有将我的 pods 缩小到 4,尽管没有看到 CPU 使用。
当我查看 GKE 部署详细信息仪表板时,我看到警告 Unable to read all metrics
这会导致自动缩放器无法缩减我的 pods 吗?
我该如何解决?
据我了解,GKE 会自动运行指标服务器:
kubectl get deployment --namespace=kube-system
NAME READY UP-TO-DATE AVAILABLE AGE
event-exporter-gke 1/1 1 1 18d
kube-dns 2/2 2 2 18d
kube-dns-autoscaler 1/1 1 1 18d
l7-default-backend 1/1 1 1 18d
metrics-server-v0.3.6 1/1 1 1 18d
stackdriver-metadata-agent-cluster-level 1/1 1 1 18d
这是该指标服务器的配置:
Name: metrics-server-v0.3.6
Namespace: kube-system
CreationTimestamp: Sun, 21 Feb 2021 11:20:55 -0800
Labels: addonmanager.kubernetes.io/mode=Reconcile
k8s-app=metrics-server
kubernetes.io/cluster-service=true
version=v0.3.6
Annotations: deployment.kubernetes.io/revision: 14
Selector: k8s-app=metrics-server,version=v0.3.6
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: k8s-app=metrics-server
version=v0.3.6
Annotations: seccomp.security.alpha.kubernetes.io/pod: docker/default
Service Account: metrics-server
Containers:
metrics-server:
Image: k8s.gcr.io/metrics-server-amd64:v0.3.6
Port: 443/TCP
Host Port: 0/TCP
Command:
/metrics-server
--metric-resolution=30s
--kubelet-port=10255
--deprecated-kubelet-completely-insecure=true
--kubelet-preferred-address-types=InternalIP,Hostname,InternalDNS,ExternalDNS,ExternalIP
Limits:
cpu: 48m
memory: 95Mi
Requests:
cpu: 48m
memory: 95Mi
Environment: <none>
Mounts: <none>
metrics-server-nanny:
Image: gke.gcr.io/addon-resizer:1.8.10-gke.0
Port: <none>
Host Port: <none>
Command:
/pod_nanny
--config-dir=/etc/config
--cpu=40m
--extra-cpu=0.5m
--memory=35Mi
--extra-memory=4Mi
--threshold=5
--deployment=metrics-server-v0.3.6
--container=metrics-server
--poll-period=300000
--estimator=exponential
--scale-down-delay=24h
--minClusterSize=5
--use-metrics=true
Limits:
cpu: 100m
memory: 300Mi
Requests:
cpu: 5m
memory: 50Mi
Environment:
MY_POD_NAME: (v1:metadata.name)
MY_POD_NAMESPACE: (v1:metadata.namespace)
Mounts:
/etc/config from metrics-server-config-volume (rw)
Volumes:
metrics-server-config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: metrics-server-config
Optional: false
Priority Class Name: system-cluster-critical
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: metrics-server-v0.3.6-787886f769 (1/1 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 3m10s (x2 over 5m39s) deployment-controller Scaled up replica set metrics-server-v0.3.6-7c9d64c44 to 1
Normal ScalingReplicaSet 2m54s (x2 over 5m23s) deployment-controller Scaled down replica set metrics-server-v0.3.6-787886f769 to 0
Normal ScalingReplicaSet 2m50s (x2 over 4m49s) deployment-controller Scaled up replica set metrics-server-v0.3.6-787886f769 to 1
Normal ScalingReplicaSet 2m33s (x2 over 4m34s) deployment-controller Scaled down replica set metrics-server-v0.3.6-7c9d64c44 to 0
编辑:2021-03-13
这是 api 服务器部署的配置:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server-deployment
spec:
replicas: 12
selector:
matchLabels:
app: api-server
template:
metadata:
labels:
app: api-server
spec:
serviceAccountName: api-kubernetes-service-account
nodeSelector:
#<labelname>:value
cloud.google.com/gke-nodepool: api-nodepool
containers:
- name: api-server
image: gcr.io/questions-279902/taskserver:latest
imagePullPolicy: "Always"
ports:
- containerPort: 80
#- containerPort: 443
args:
- --disable_https
- --db_ip_address=127.0.0.1
- --modelserver_address=http://10.128.0.18:8501 # kubectl get service model-service --output yaml
resources:
# You must specify requests for CPU to autoscale
# based on CPU utilization
requests:
cpu: "250m"
- name: cloud-sql-proxy
...
我没有看到分配任何“resources:”字段(例如 cpu、mem 等),这应该是根本原因。
请注意,在 HPA(Horizontal Pod Autoscaler)上设置资源是一项要求,官方对此有解释 Kubernetes documentation
请注意,如果某些 Pod 的容器没有设置相关的资源请求,CPU Pod 的利用率将不会被定义,自动缩放器也不会对该指标采取任何行动。
这肯定会导致消息无法读取目标 Deployment 上的所有指标。
我正在尝试设置 Horizontal Pod Autoscaler 以根据 CPU 使用情况自动扩大和缩小我的 api 服务器 pods。
我的 API 目前有 12 pods 运行,但他们正在使用 ~0% CPU.
kubectl get pods
NAME READY STATUS RESTARTS AGE
api-server-deployment-578f8d8649-4cbtc 2/2 Running 2 12h
api-server-deployment-578f8d8649-8cv77 2/2 Running 2 12h
api-server-deployment-578f8d8649-c8tv2 2/2 Running 1 12h
api-server-deployment-578f8d8649-d8c6r 2/2 Running 2 12h
api-server-deployment-578f8d8649-lvbgn 2/2 Running 1 12h
api-server-deployment-578f8d8649-lzjmj 2/2 Running 2 12h
api-server-deployment-578f8d8649-nztck 2/2 Running 1 12h
api-server-deployment-578f8d8649-q25xb 2/2 Running 2 12h
api-server-deployment-578f8d8649-tx75t 2/2 Running 1 12h
api-server-deployment-578f8d8649-wbzzh 2/2 Running 2 12h
api-server-deployment-578f8d8649-wtddv 2/2 Running 1 12h
api-server-deployment-578f8d8649-x95gq 2/2 Running 2 12h
model-server-deployment-76d466dffc-4g2nd 1/1 Running 0 23h
model-server-deployment-76d466dffc-9pqw5 1/1 Running 0 23h
model-server-deployment-76d466dffc-d29fx 1/1 Running 0 23h
model-server-deployment-76d466dffc-frrgn 1/1 Running 0 23h
model-server-deployment-76d466dffc-sfh45 1/1 Running 0 23h
model-server-deployment-76d466dffc-w2hqj 1/1 Running 0 23h
我的 api_hpa.yaml 看起来像:
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server-deployment
minReplicas: 4
maxReplicas: 12
targetCPUUtilizationPercentage: 50
现在已经 24 小时了,HPA 仍然没有将我的 pods 缩小到 4,尽管没有看到 CPU 使用。
当我查看 GKE 部署详细信息仪表板时,我看到警告 Unable to read all metrics
这会导致自动缩放器无法缩减我的 pods 吗?
我该如何解决?
据我了解,GKE 会自动运行指标服务器:
kubectl get deployment --namespace=kube-system
NAME READY UP-TO-DATE AVAILABLE AGE
event-exporter-gke 1/1 1 1 18d
kube-dns 2/2 2 2 18d
kube-dns-autoscaler 1/1 1 1 18d
l7-default-backend 1/1 1 1 18d
metrics-server-v0.3.6 1/1 1 1 18d
stackdriver-metadata-agent-cluster-level 1/1 1 1 18d
这是该指标服务器的配置:
Name: metrics-server-v0.3.6
Namespace: kube-system
CreationTimestamp: Sun, 21 Feb 2021 11:20:55 -0800
Labels: addonmanager.kubernetes.io/mode=Reconcile
k8s-app=metrics-server
kubernetes.io/cluster-service=true
version=v0.3.6
Annotations: deployment.kubernetes.io/revision: 14
Selector: k8s-app=metrics-server,version=v0.3.6
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: k8s-app=metrics-server
version=v0.3.6
Annotations: seccomp.security.alpha.kubernetes.io/pod: docker/default
Service Account: metrics-server
Containers:
metrics-server:
Image: k8s.gcr.io/metrics-server-amd64:v0.3.6
Port: 443/TCP
Host Port: 0/TCP
Command:
/metrics-server
--metric-resolution=30s
--kubelet-port=10255
--deprecated-kubelet-completely-insecure=true
--kubelet-preferred-address-types=InternalIP,Hostname,InternalDNS,ExternalDNS,ExternalIP
Limits:
cpu: 48m
memory: 95Mi
Requests:
cpu: 48m
memory: 95Mi
Environment: <none>
Mounts: <none>
metrics-server-nanny:
Image: gke.gcr.io/addon-resizer:1.8.10-gke.0
Port: <none>
Host Port: <none>
Command:
/pod_nanny
--config-dir=/etc/config
--cpu=40m
--extra-cpu=0.5m
--memory=35Mi
--extra-memory=4Mi
--threshold=5
--deployment=metrics-server-v0.3.6
--container=metrics-server
--poll-period=300000
--estimator=exponential
--scale-down-delay=24h
--minClusterSize=5
--use-metrics=true
Limits:
cpu: 100m
memory: 300Mi
Requests:
cpu: 5m
memory: 50Mi
Environment:
MY_POD_NAME: (v1:metadata.name)
MY_POD_NAMESPACE: (v1:metadata.namespace)
Mounts:
/etc/config from metrics-server-config-volume (rw)
Volumes:
metrics-server-config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: metrics-server-config
Optional: false
Priority Class Name: system-cluster-critical
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: metrics-server-v0.3.6-787886f769 (1/1 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 3m10s (x2 over 5m39s) deployment-controller Scaled up replica set metrics-server-v0.3.6-7c9d64c44 to 1
Normal ScalingReplicaSet 2m54s (x2 over 5m23s) deployment-controller Scaled down replica set metrics-server-v0.3.6-787886f769 to 0
Normal ScalingReplicaSet 2m50s (x2 over 4m49s) deployment-controller Scaled up replica set metrics-server-v0.3.6-787886f769 to 1
Normal ScalingReplicaSet 2m33s (x2 over 4m34s) deployment-controller Scaled down replica set metrics-server-v0.3.6-7c9d64c44 to 0
编辑:2021-03-13
这是 api 服务器部署的配置:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server-deployment
spec:
replicas: 12
selector:
matchLabels:
app: api-server
template:
metadata:
labels:
app: api-server
spec:
serviceAccountName: api-kubernetes-service-account
nodeSelector:
#<labelname>:value
cloud.google.com/gke-nodepool: api-nodepool
containers:
- name: api-server
image: gcr.io/questions-279902/taskserver:latest
imagePullPolicy: "Always"
ports:
- containerPort: 80
#- containerPort: 443
args:
- --disable_https
- --db_ip_address=127.0.0.1
- --modelserver_address=http://10.128.0.18:8501 # kubectl get service model-service --output yaml
resources:
# You must specify requests for CPU to autoscale
# based on CPU utilization
requests:
cpu: "250m"
- name: cloud-sql-proxy
...
我没有看到分配任何“resources:”字段(例如 cpu、mem 等),这应该是根本原因。 请注意,在 HPA(Horizontal Pod Autoscaler)上设置资源是一项要求,官方对此有解释 Kubernetes documentation
请注意,如果某些 Pod 的容器没有设置相关的资源请求,CPU Pod 的利用率将不会被定义,自动缩放器也不会对该指标采取任何行动。
这肯定会导致消息无法读取目标 Deployment 上的所有指标。