k8s pod 卡在状态 "pending"
k8s pod stuck in status "pending"
所有新容器都停留在“待定”状态。这似乎不是资源问题,因为集群总利用率约为 10% cpu、30% 内存。
如何更深入地了解问题?
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
cq-iam-boarding-77fd94dc94-8pc6f 1/1 Running 0 30h
cq-iam-demo-cloud-6b99f6544d-9v7j7 1/1 Running 0 30h
cq-iam-mpm-dev-8c6cc58fd-fczlw 1/1 Running 0 30h
cq-iam-proxy-86854cc78d-49gfw 0/1 Terminating 0 7h42m
cq-iam-proxy-86854cc78d-dqlz8 0/1 Terminating 0 7h36m
cq-iam-proxy-86854cc78d-m7zs2 0/1 Pending 0 5h22m
cq-launchpad-app-7b57c478b9-gqcxj 1/1 Running 0 13h
cq-management-api-7c689c7846-q9fz2 1/1 Running 0 29h
cq-opa-api-8458db697c-75rzd 1/1 Running 0 30h
cq-settings-app-6874885794-mspj9 1/1 Running 0 29h
node-debugger-aks-nodepool1-31127038-vmss000000-czt8s 0/1 Pending 0 8h
$ kubectl top pods
NAME CPU(cores) MEMORY(bytes)
cq-iam-boarding-77fd94dc94-8pc6f 2m 482Mi
cq-iam-demo-cloud-6b99f6544d-9v7j7 2m 507Mi
cq-iam-mpm-dev-8c6cc58fd-fczlw 2m 443Mi
cq-launchpad-app-7b57c478b9-gqcxj 0m 2Mi
cq-management-api-7c689c7846-q9fz2 1m 88Mi
cq-opa-api-8458db697c-75rzd 1m 17Mi
cq-settings-app-6874885794-mspj9 1m 2Mi
$ kubectl describe pod cq-iam-proxy-86854cc78d-m7zs2
Name: cq-iam-proxy-86854cc78d-m7zs2
Namespace: dev
Priority: 0
Node: aks-nodepool1-31127038-vmss000000/
Labels: app=cq-iam-proxy
pod-template-hash=86854cc78d
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/cq-iam-proxy-86854cc78d
Containers:
cq-iam-proxy:
Image: xxx.azurecr.io/karneval/cq-iam-proxy:1.0.14
Port: 80/TCP
Host Port: 0/TCP
Environment:
CQ_HOSTNAME: dev.hvt.zone
key1: TODO
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-pl6p4 (ro)
Conditions:
Type Status
PodScheduled True
Volumes:
default-token-pl6p4:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-pl6p4
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
检查节点池 1 的状态:
- nodepool 都很好 [=74=]
- 有三个节点都是绿色的(内存、磁盘、就绪)
你能显示 pod 的日志吗?
这是我打印 pod 日志时得到的:
$ kubectl logs cq-iam-proxy-86854cc78d-m7zs2
Error from server (NotFound): the server could not find the requested resource ( pods/log cq-iam-proxy-86854cc78d-m7zs2)
请在终止状态中包含 pods 的事件。那里可能有线索:
$ kubectl describe pod cq-iam-proxy-86854cc78d-49gfw
Name: cq-iam-proxy-86854cc78d-49gfw
Namespace: dev
Priority: 0
Node: aks-nodepool1-31127038-vmss000000/
Labels: app=cq-iam-proxy
pod-template-hash=86854cc78d
Annotations: <none>
Status: Terminating (lasts 2d18h)
Termination Grace Period: 30s
IP:
IPs: <none>
Controlled By: ReplicaSet/cq-iam-proxy-86854cc78d
Containers:
cq-iam-proxy:
Image: xxx.azurecr.io/karneval/cq-iam-proxy:1.0.14
Port: 80/TCP
Host Port: 0/TCP
Environment:
CQ_HOSTNAME: dev.hvt.zone
key1: TODO
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-pl6p4 (ro)
Conditions:
Type Status
PodScheduled True
Volumes:
default-token-pl6p4:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-pl6p4
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
那里没有活动?那两个pods的日志里有什么吗?
$ kubectl logs cq-iam-proxy-86854cc78d-dqlz8
Error from server (NotFound): the server could not find the requested resource ( pods/log cq-iam-proxy-86854cc78d-dqlz8)
这似乎是应用程序本身的问题。
看来不是应用本身的问题。我运行这两个命令:
$ kubectl run --image=busybox myapp -- false
$ kubectl run --image=busybox myapp2 -- false
myapp
能够启动
myapp2
处于挂起模式(与其他应用程序相同)
myapp 0/1 CrashLoopBackOff 5 11m
myapp2 0/1 Pending 0 9m26s
$ kubectl describe pod myapp
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 11m default-scheduler Successfully assigned dev/myapp to aks-nodepool1-31127038-vmss000001
Normal Created 11m (x4 over 11m) kubelet Created container myapp
Normal Started 11m (x4 over 11m) kubelet Started container myapp
Normal Pulling 10m (x5 over 11m) kubelet Pulling image "busybox"
Normal Pulled 10m (x5 over 11m) kubelet Successfully pulled image "busybox"
Warning BackOff 95s (x47 over 11m) kubelet Back-off restarting failed container
$ kubectl describe pod myapp2
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 10m default-scheduler Successfully assigned dev/myapp2 to aks-nodepool1-31127038-vmss000000
myapp
和myapp2
的唯一区别是它们被调度在不同的节点上:
myapp
已在节点 aks-nodepool1-31127038-vmss000001
上成功启动
myapp2
未在节点 aks-nodepool1-31127038-vmss000000
上启动
两周后集群自愈。
节点 nodepool1-31127038-vmss000000
有问题,启动容器时会卡住。
下次遇到这个问题我会玩these commands治愈节点:
kubectl cordon my-node # Mark my-node as unschedulable
kubectl drain my-node # Drain my-node in preparation for maintenance
kubectl uncordon my-node # Mark my-node as schedulable
kubectl top node my-node # Show metrics for a given node
所有新容器都停留在“待定”状态。这似乎不是资源问题,因为集群总利用率约为 10% cpu、30% 内存。
如何更深入地了解问题?
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
cq-iam-boarding-77fd94dc94-8pc6f 1/1 Running 0 30h
cq-iam-demo-cloud-6b99f6544d-9v7j7 1/1 Running 0 30h
cq-iam-mpm-dev-8c6cc58fd-fczlw 1/1 Running 0 30h
cq-iam-proxy-86854cc78d-49gfw 0/1 Terminating 0 7h42m
cq-iam-proxy-86854cc78d-dqlz8 0/1 Terminating 0 7h36m
cq-iam-proxy-86854cc78d-m7zs2 0/1 Pending 0 5h22m
cq-launchpad-app-7b57c478b9-gqcxj 1/1 Running 0 13h
cq-management-api-7c689c7846-q9fz2 1/1 Running 0 29h
cq-opa-api-8458db697c-75rzd 1/1 Running 0 30h
cq-settings-app-6874885794-mspj9 1/1 Running 0 29h
node-debugger-aks-nodepool1-31127038-vmss000000-czt8s 0/1 Pending 0 8h
$ kubectl top pods
NAME CPU(cores) MEMORY(bytes)
cq-iam-boarding-77fd94dc94-8pc6f 2m 482Mi
cq-iam-demo-cloud-6b99f6544d-9v7j7 2m 507Mi
cq-iam-mpm-dev-8c6cc58fd-fczlw 2m 443Mi
cq-launchpad-app-7b57c478b9-gqcxj 0m 2Mi
cq-management-api-7c689c7846-q9fz2 1m 88Mi
cq-opa-api-8458db697c-75rzd 1m 17Mi
cq-settings-app-6874885794-mspj9 1m 2Mi
$ kubectl describe pod cq-iam-proxy-86854cc78d-m7zs2
Name: cq-iam-proxy-86854cc78d-m7zs2
Namespace: dev
Priority: 0
Node: aks-nodepool1-31127038-vmss000000/
Labels: app=cq-iam-proxy
pod-template-hash=86854cc78d
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/cq-iam-proxy-86854cc78d
Containers:
cq-iam-proxy:
Image: xxx.azurecr.io/karneval/cq-iam-proxy:1.0.14
Port: 80/TCP
Host Port: 0/TCP
Environment:
CQ_HOSTNAME: dev.hvt.zone
key1: TODO
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-pl6p4 (ro)
Conditions:
Type Status
PodScheduled True
Volumes:
default-token-pl6p4:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-pl6p4
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
检查节点池 1 的状态:
- nodepool 都很好 [=74=]
- 有三个节点都是绿色的(内存、磁盘、就绪)
你能显示 pod 的日志吗?
这是我打印 pod 日志时得到的:
$ kubectl logs cq-iam-proxy-86854cc78d-m7zs2
Error from server (NotFound): the server could not find the requested resource ( pods/log cq-iam-proxy-86854cc78d-m7zs2)
请在终止状态中包含 pods 的事件。那里可能有线索:
$ kubectl describe pod cq-iam-proxy-86854cc78d-49gfw
Name: cq-iam-proxy-86854cc78d-49gfw
Namespace: dev
Priority: 0
Node: aks-nodepool1-31127038-vmss000000/
Labels: app=cq-iam-proxy
pod-template-hash=86854cc78d
Annotations: <none>
Status: Terminating (lasts 2d18h)
Termination Grace Period: 30s
IP:
IPs: <none>
Controlled By: ReplicaSet/cq-iam-proxy-86854cc78d
Containers:
cq-iam-proxy:
Image: xxx.azurecr.io/karneval/cq-iam-proxy:1.0.14
Port: 80/TCP
Host Port: 0/TCP
Environment:
CQ_HOSTNAME: dev.hvt.zone
key1: TODO
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-pl6p4 (ro)
Conditions:
Type Status
PodScheduled True
Volumes:
default-token-pl6p4:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-pl6p4
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
那里没有活动?那两个pods的日志里有什么吗?
$ kubectl logs cq-iam-proxy-86854cc78d-dqlz8
Error from server (NotFound): the server could not find the requested resource ( pods/log cq-iam-proxy-86854cc78d-dqlz8)
这似乎是应用程序本身的问题。
看来不是应用本身的问题。我运行这两个命令:
$ kubectl run --image=busybox myapp -- false
$ kubectl run --image=busybox myapp2 -- false
myapp
能够启动myapp2
处于挂起模式(与其他应用程序相同)
myapp 0/1 CrashLoopBackOff 5 11m
myapp2 0/1 Pending 0 9m26s
$ kubectl describe pod myapp
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 11m default-scheduler Successfully assigned dev/myapp to aks-nodepool1-31127038-vmss000001
Normal Created 11m (x4 over 11m) kubelet Created container myapp
Normal Started 11m (x4 over 11m) kubelet Started container myapp
Normal Pulling 10m (x5 over 11m) kubelet Pulling image "busybox"
Normal Pulled 10m (x5 over 11m) kubelet Successfully pulled image "busybox"
Warning BackOff 95s (x47 over 11m) kubelet Back-off restarting failed container
$ kubectl describe pod myapp2
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 10m default-scheduler Successfully assigned dev/myapp2 to aks-nodepool1-31127038-vmss000000
myapp
和myapp2
的唯一区别是它们被调度在不同的节点上:
myapp
已在节点aks-nodepool1-31127038-vmss000001
上成功启动
myapp2
未在节点aks-nodepool1-31127038-vmss000000
上启动
两周后集群自愈。
节点 nodepool1-31127038-vmss000000
有问题,启动容器时会卡住。
下次遇到这个问题我会玩these commands治愈节点:
kubectl cordon my-node # Mark my-node as unschedulable
kubectl drain my-node # Drain my-node in preparation for maintenance
kubectl uncordon my-node # Mark my-node as schedulable
kubectl top node my-node # Show metrics for a given node