在 Google Kubernetes Engine 上使用 Horizo​​ntal Pod Autoscaler 失败并显示:无法读取所有指标

Using Horizontal Pod Autoscaler on Google Kubernetes Engine fails with: Unable to read all metrics

我正在尝试设置 Horizo​​ntal Pod Autoscaler 以根据 CPU 使用情况自动扩大和缩小我的 api 服务器 pods。

我的 API 目前有 12 pods 运行,但他们正在使用 ~0% CPU.

kubectl get pods
NAME                                       READY   STATUS    RESTARTS   AGE
api-server-deployment-578f8d8649-4cbtc     2/2     Running   2          12h
api-server-deployment-578f8d8649-8cv77     2/2     Running   2          12h
api-server-deployment-578f8d8649-c8tv2     2/2     Running   1          12h
api-server-deployment-578f8d8649-d8c6r     2/2     Running   2          12h
api-server-deployment-578f8d8649-lvbgn     2/2     Running   1          12h
api-server-deployment-578f8d8649-lzjmj     2/2     Running   2          12h
api-server-deployment-578f8d8649-nztck     2/2     Running   1          12h
api-server-deployment-578f8d8649-q25xb     2/2     Running   2          12h
api-server-deployment-578f8d8649-tx75t     2/2     Running   1          12h
api-server-deployment-578f8d8649-wbzzh     2/2     Running   2          12h
api-server-deployment-578f8d8649-wtddv     2/2     Running   1          12h
api-server-deployment-578f8d8649-x95gq     2/2     Running   2          12h
model-server-deployment-76d466dffc-4g2nd   1/1     Running   0          23h
model-server-deployment-76d466dffc-9pqw5   1/1     Running   0          23h
model-server-deployment-76d466dffc-d29fx   1/1     Running   0          23h
model-server-deployment-76d466dffc-frrgn   1/1     Running   0          23h
model-server-deployment-76d466dffc-sfh45   1/1     Running   0          23h
model-server-deployment-76d466dffc-w2hqj   1/1     Running   0          23h

我的 api_hpa.yaml 看起来像:

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server-deployment
  minReplicas: 4
  maxReplicas: 12
  targetCPUUtilizationPercentage: 50

现在已经 24 小时了,HPA 仍然没有将我的 pods 缩小到 4,尽管没有看到 CPU 使用。

当我查看 GKE 部署详细信息仪表板时,我看到警告 Unable to read all metrics

这会导致自动缩放器无法缩减我的 pods 吗?

我该如何解决?

据我了解,GKE 会自动运行指标服务器:

kubectl get deployment --namespace=kube-system
NAME                                       READY   UP-TO-DATE   AVAILABLE   AGE
event-exporter-gke                         1/1     1            1           18d
kube-dns                                   2/2     2            2           18d
kube-dns-autoscaler                        1/1     1            1           18d
l7-default-backend                         1/1     1            1           18d
metrics-server-v0.3.6                      1/1     1            1           18d
stackdriver-metadata-agent-cluster-level   1/1     1            1           18d

这是该指标服务器的配置:

Name:                   metrics-server-v0.3.6
Namespace:              kube-system
CreationTimestamp:      Sun, 21 Feb 2021 11:20:55 -0800
Labels:                 addonmanager.kubernetes.io/mode=Reconcile
                        k8s-app=metrics-server
                        kubernetes.io/cluster-service=true
                        version=v0.3.6
Annotations:            deployment.kubernetes.io/revision: 14
Selector:               k8s-app=metrics-server,version=v0.3.6
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           k8s-app=metrics-server
                    version=v0.3.6
  Annotations:      seccomp.security.alpha.kubernetes.io/pod: docker/default
  Service Account:  metrics-server
  Containers:
   metrics-server:
    Image:      k8s.gcr.io/metrics-server-amd64:v0.3.6
    Port:       443/TCP
    Host Port:  0/TCP
    Command:
      /metrics-server
      --metric-resolution=30s
      --kubelet-port=10255
      --deprecated-kubelet-completely-insecure=true
      --kubelet-preferred-address-types=InternalIP,Hostname,InternalDNS,ExternalDNS,ExternalIP
    Limits:
      cpu:     48m
      memory:  95Mi
    Requests:
      cpu:        48m
      memory:     95Mi
    Environment:  <none>
    Mounts:       <none>
   metrics-server-nanny:
    Image:      gke.gcr.io/addon-resizer:1.8.10-gke.0
    Port:       <none>
    Host Port:  <none>
    Command:
      /pod_nanny
      --config-dir=/etc/config
      --cpu=40m
      --extra-cpu=0.5m
      --memory=35Mi
      --extra-memory=4Mi
      --threshold=5
      --deployment=metrics-server-v0.3.6
      --container=metrics-server
      --poll-period=300000
      --estimator=exponential
      --scale-down-delay=24h
      --minClusterSize=5
      --use-metrics=true
    Limits:
      cpu:     100m
      memory:  300Mi
    Requests:
      cpu:     5m
      memory:  50Mi
    Environment:
      MY_POD_NAME:        (v1:metadata.name)
      MY_POD_NAMESPACE:   (v1:metadata.namespace)
    Mounts:
      /etc/config from metrics-server-config-volume (rw)
  Volumes:
   metrics-server-config-volume:
    Type:               ConfigMap (a volume populated by a ConfigMap)
    Name:               metrics-server-config
    Optional:           false
  Priority Class Name:  system-cluster-critical
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  <none>
NewReplicaSet:   metrics-server-v0.3.6-787886f769 (1/1 replicas created)
Events:
  Type    Reason             Age                    From                   Message
  ----    ------             ----                   ----                   -------
  Normal  ScalingReplicaSet  3m10s (x2 over 5m39s)  deployment-controller  Scaled up replica set metrics-server-v0.3.6-7c9d64c44 to 1
  Normal  ScalingReplicaSet  2m54s (x2 over 5m23s)  deployment-controller  Scaled down replica set metrics-server-v0.3.6-787886f769 to 0
  Normal  ScalingReplicaSet  2m50s (x2 over 4m49s)  deployment-controller  Scaled up replica set metrics-server-v0.3.6-787886f769 to 1
  Normal  ScalingReplicaSet  2m33s (x2 over 4m34s)  deployment-controller  Scaled down replica set metrics-server-v0.3.6-7c9d64c44 to 0

编辑:2021-03-13

这是 api 服务器部署的配置:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server-deployment
spec:
  replicas: 12
  selector:

    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
    spec:
      serviceAccountName: api-kubernetes-service-account
      nodeSelector:
        #<labelname>:value
        cloud.google.com/gke-nodepool: api-nodepool
      containers:
      - name: api-server
        image: gcr.io/questions-279902/taskserver:latest
        imagePullPolicy: "Always"
        ports: 
        - containerPort: 80
        #- containerPort: 443
        args:
        - --disable_https
        - --db_ip_address=127.0.0.1
        - --modelserver_address=http://10.128.0.18:8501 # kubectl get service model-service --output yaml
        resources:
          # You must specify requests for CPU to autoscale
          # based on CPU utilization
          requests:
            cpu: "250m"
      - name: cloud-sql-proxy
...

我没有看到分配任何“resources:”字段(例如 cpu、mem 等),这应该是根本原因。 请注意,在 HPA(Horizo​​ntal Pod Autoscaler)上设置资源是一项要求,官方对此有解释 Kubernetes documentation

请注意,如果某些 Pod 的容器没有设置相关的资源请求,CPU Pod 的利用率将不会被定义,自动缩放器也不会对该指标采取任何行动。

这肯定会导致消息无法读取目标 Deployment 上的所有指标。