使用 Autoscaler 在 GCP 上无法调度 Kubernetes pods

Question

我有一个 Kubernetes 集群，其中 pods 使用 Autopilot 可自动扩展。突然他们停止自动缩放，我是 Kubernetes 的新手，我不知道该做什么或应该在控制台中显示什么以寻求帮助。

pods 自动处于不可调度状态，并且在集群内部将他的状态设为 Pending 而不是运行并且不允许我进入或交互。

我也无法在 GCP 控制台删除或停止它们。没有关于内存或 CPU 不足的问题，因为上面的服务器运行不多。

在我遇到这个问题之前，集群按预期工作。

Namespace:      default
Priority:       0
Node:           <none>
Labels:         app=odoo-service
                pod-template-hash=5bd88899d7
Annotations:    seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/odoo-cluster-dev-5bd88899d7
Containers:
  odoo-service:
    Image:      us-central1-docker.pkg.dev/adams-dev/adams-odoo/odoo-service:v58
    Port:       <none>
    Host Port:  <none>
    Limits:
      cpu:                2
      ephemeral-storage:  1Gi
      memory:             8Gi
    Requests:
      cpu:                2
      ephemeral-storage:  1Gi
      memory:             8Gi
    Environment:
      ODOO_HTTP_SOCKET_TIMEOUT:  30
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zqh5r (ro)
  cloud-sql-proxy:
    Image:      gcr.io/cloudsql-docker/gce-proxy:1.17
    Port:       <none>
    Host Port:  <none>
    Command:
      /cloud_sql_proxy
      -instances=adams-dev:us-central1:odoo-test=tcp:5432
    Limits:
      cpu:                1
      ephemeral-storage:  1Gi
      memory:             2Gi
    Requests:
      cpu:                1
      ephemeral-storage:  1Gi
      memory:             2Gi
    Environment:          <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zqh5r (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-api-access-zqh5r:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason             Age                     From                                   Message
  ----     ------             ----                    ----                                   -------
  Normal   NotTriggerScaleUp  28m (x248 over 3h53m)   cluster-autoscaler                     pod didn't trigger scale-up (it wouldn't fit if a new node is added): 4 in backoff after failed scale-up, 2 Insufficient cpu, 2 Insufficient memory
  Normal   NotTriggerScaleUp  8m1s (x261 over 3h55m)  cluster-autoscaler                     pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 Insufficient memory, 4 in backoff after failed scale-up, 2 Insufficient cpu
  Normal   NotTriggerScaleUp  3m (x1646 over 3h56m)   cluster-autoscaler                     pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 Insufficient cpu, 2 Insufficient memory, 4 in backoff after failed scale-up
  Warning  FailedScheduling   20s (x168 over 3h56m)   gke.io/optimize-utilization-scheduler  0/2 nodes are available: 2 Insufficient cpu, 2 Insufficient memory.


Events:
  Type     Reason             Age                      From                                   Message
  ----     ------             ----                     ----                                   -------
  Normal   NotTriggerScaleUp  28m (x250 over 3h56m)    cluster-autoscaler                     pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 Insufficient memory, 4 in backoff after failed scale-up, 2 Insufficient cpu
  Normal   NotTriggerScaleUp  8m2s (x300 over 3h55m)   cluster-autoscaler                     pod didn't trigger scale-up (it wouldn't fit if a new node is added): 4 in backoff after failed scale-up, 2 Insufficient cpu, 2 Insufficient memory
  Warning  FailedScheduling   5m21s (x164 over 3h56m)  gke.io/optimize-utilization-scheduler  0/2 nodes are available: 2 Insufficient cpu, 2 Insufficient memory.
  Normal   NotTriggerScaleUp  3m1s (x1616 over 3h55m)  cluster-autoscaler                     pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 Insufficient cpu, 2 Insufficient memory, 4 in backoff after failed scale-up

我不知道我能调试或修复多少。

Answer 1

Pods 无法在任何节点上安排，因为 none 个节点有 cpu 个可用。

集群自动缩放器尝试扩展，但在扩展尝试失败后退出，这表明扩展作为节点池一部分的托管实例组可能存在问题。

集群自动缩放器尝试扩展，但由于达到配额限制，无法添加新节点。

您看不到计入您的配额的 Autopilot GKE 虚拟机。

尝试在其他地区创建自动驾驶集群。如果自动驾驶仪集群不再满足您的需求，请选择标准集群。

使用 Autoscaler 在 GCP 上无法调度 Kubernetes pods

Unschedulable Kubernetes pods on GCP using Autoscaler

google-cloud-platform

kubernetes

google-kubernetes-engine

kubernetes-pod