EKS Pods 被无故终止
EKS Pods being terminated for no reason
不知是否有人可以帮助我。
Kubernetes(K8s 1.21 平台 eks.4)正在终止 运行ning pods,没有错误或原因。我在活动中唯一能看到的是:
7m47s Normal Killing pod/test-job-6c9fn-qbzkb Stopping container test-job
因为我设置了反亲和规则,一个节点只能有一个pod可以运行。因此,每次一个 pod 被杀死时,autoscaler 都会启动另一个节点。
这些是集群自动缩放器日志
I0208 19:10:42.336476 1 cluster.go:148] Fast evaluation: ip-10-4-127-38.us-west-2.compute.internal for removal
I0208 19:10:42.336484 1 cluster.go:169] Fast evaluation: node ip-10-4-127-38.us-west-2.compute.internal cannot be removed: pod annotated as not safe to evict present: test-job-6c9fn-qbzkb
I0208 19:10:42.336493 1 scale_down.go:612] 1 nodes found to be unremovable in simulation, will re-check them at 2022-02-08 19:15:42.335305238 +0000 UTC m=+20363.008486077
I0208 19:15:04.360683 1 klogx.go:86] Pod default/test-job-6c9fn-8wx2q is unschedulable
I0208 19:15:04.360719 1 scale_up.go:376] Upcoming 0 nodes
I0208 19:15:04.360861 1 scale_up.go:300] Pod test-job-6c9fn-8wx2q can't be scheduled on eks-ec2-8xlarge-84bf6ad9-ca4a-4293-a3e8-95bef28db16d, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
I0208 19:15:04.360901 1 scale_up.go:449] No pod can fit to eks-ec2-8xlarge-84bf6ad9-ca4a-4293-a3e8-95bef28db16d
I0208 19:15:04.361035 1 scale_up.go:300] Pod test-job-6c9fn-8wx2q can't be scheduled on eks-ec2-inf1-90bf6ad9-caf7-74e8-c930-b80f785bc743, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
I0208 19:15:04.361062 1 scale_up.go:449] No pod can fit to eks-ec2-inf1-90bf6ad9-caf7-74e8-c930-b80f785bc743
I0208 19:15:04.361162 1 scale_up.go:300] Pod test-job-6c9fn-8wx2q can't be scheduled on eks-ec2-large-62bf6ad9-ccd4-6e03-5c78-c3366d387d50, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
I0208 19:15:04.361194 1 scale_up.go:449] No pod can fit to eks-ec2-large-62bf6ad9-ccd4-6e03-5c78-c3366d387d50
I0208 19:15:04.361512 1 scale_up.go:412] Skipping node group eks-eks-on-demand-10bf6ad9-c978-9b35-c7fc-cdb9977b27cb - max size reached
I0208 19:15:04.361675 1 scale_up.go:300] Pod test-job-6c9fn-8wx2q can't be scheduled on eks-ec2-test-58bf6d43-13e8-9acc-5173-b8c5054a56da, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
I0208 19:15:04.361711 1 scale_up.go:449] No pod can fit to eks-ec2-test-58bf6d43-13e8-9acc-5173-b8c5054a56da
I0208 19:15:04.361723 1 waste.go:57] Expanding Node Group eks-ec2-xlarge-84bf6ad9-cb6d-7e24-7eb5-a00c369fd82f would waste 75.00% CPU, 86.92% Memory, 80.96% Blended
I0208 19:15:04.361747 1 scale_up.go:468] Best option to resize: eks-ec2-xlarge-84bf6ad9-cb6d-7e24-7eb5-a00c369fd82f
I0208 19:15:04.361762 1 scale_up.go:472] Estimated 1 nodes needed in eks-ec2-xlarge-84bf6ad9-cb6d-7e24-7eb5-a00c369fd82f
I0208 19:15:04.361780 1 scale_up.go:586] Final scale-up plan: [{eks-ec2-xlarge-84bf6ad9-cb6d-7e24-7eb5-a00c369fd82f 0->1 (max: 2)}]
I0208 19:15:04.361801 1 scale_up.go:675] Scale-up: setting group eks-ec2-xlarge-84bf6ad9-cb6d-7e24-7eb5-a00c369fd82f size to 1
I0208 19:15:04.361826 1 auto_scaling_groups.go:219] Setting asg eks-ec2-xlarge-84bf6ad9-cb6d-7e24-7eb5-a00c369fd82f size to 1
I0208 19:15:04.362154 1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"81b80048-920c-4bf1-b2c0-ad5d067d74f4", APIVersion:"v1", ResourceVersion:"359476", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group eks-ec2-xlarge-84bf6ad9-cb6d-7e24-7eb5-a00c369fd82f size to 1
I0208 19:15:04.374021 1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"81b80048-920c-4bf1-b2c0-ad5d067d74f4", APIVersion:"v1", ResourceVersion:"359476", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group eks-ec2-xlarge-84bf6ad9-cb6d-7e24-7eb5-a00c369fd82f size to 1
I0208 19:15:04.541658 1 eventing_scale_up_processor.go:47] Skipping event processing for unschedulable pods since there is a ScaleUp attempt this loop
I0208 19:15:04.541859 1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"test-job-6c9fn-8wx2q", UID:"67beba1d-4f52-4860-91af-89e5852e4cad", APIVersion:"v1", ResourceVersion:"359507", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{eks-ec2-xlarge-84bf6ad9-cb6d-7e24-7eb5-a00c369fd82f 0->1 (max: 2)}]
我正在 运行使用集群自动缩放器和 keda 的 aws-sqs 触发器构建 EKS 集群。我已经使用 SPOT 实例设置了一个自动缩放节点组。
出于测试目的,我定义了一个 ScaledJob,它由一个容器组成,带有一个简单的 python 脚本,循环遍历 time.sleep。 Pod 应该 运行 30 分钟。但它永远不会走到这一步。一般15分钟后结束
{
"apiVersion": "keda.sh/v1alpha1",
"kind": "ScaledJob",
"metadata": {
"name": id,
"labels": {"myjobidentifier": id},
"annotations": {
"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
},
},
"spec": {
"jobTargetRef": {
"parallelism": 1,
"completions": 1,
"backoffLimit": 0,
"template": {
"metadata": {
"labels": {"job-type": id},
"annotations": {
"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
},
},
"spec": {
"affinity": {
"nodeAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": {
"nodeSelectorTerms": [
{
"matchExpressions": [
{
"key": "eks.amazonaws.com/nodegroup",
"operator": "In",
"values": group_size,
}
]
}
]
}
},
"podAntiAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": [
{
"labelSelector": {
"matchExpressions": [
{
"key": "job-type",
"operator": "In",
"values": [id],
}
]
},
"topologyKey": "kubernetes.io/hostname",
}
]
},
},
"serviceAccountName": service_account.service_account_name,
"containers": [
{
"name": id,
"image": image.image_uri,
"imagePullPolicy": "IfNotPresent",
"env": envs,
"resources": {
"requests": requests,
},
"volumeMounts": [
{
"mountPath": "/tmp",
"name": "tmp-volume",
}
],
}
],
"volumes": [
{"name": "tmp-volume", "emptyDir": {}}
],
"restartPolicy": "Never",
},
},
},
"pollingInterval": 30,
"successfulJobsHistoryLimit": 10,
"failedJobsHistoryLimit": 100,
"maxReplicaCount": 30,
"rolloutStrategy": "default",
"scalingStrategy": {"strategy": "default"},
"triggers": [
{
"type": "aws-sqs-queue",
"metadata": {
"queueURL": queue.queue_url,
"queueLength": "1",
"awsRegion": region,
"identityOwner": "operator",
},
}
],
},
}
我知道这不是资源问题(虚拟代码和大实例),也不是驱逐问题(从日志中可以清楚地看出 pod 不会被驱逐),但我真的不知道如何不再解决此问题。
非常感谢!!
编辑:
与按需实例和 SPOT 实例的行为相同。
编辑 2:
我添加了 aws 节点终止处理程序,现在我似乎看到了其他事件:
ip-10-4-126-234.us-west-2.compute.internal.16d223107de38c5f
NodeNotSchedulable
Node ip-10-4-126-234.us-west-2.compute.internal status is now: NodeNotSchedulable
test-job-p85f2-txflr.16d2230ea91217a9
FailedScheduling
0/2 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) were unschedulable.
如果我检查缩放组activity:
Instance i-03d27a1cf341405e1 was taken out of service in response to a user request, shrinking the capacity from 1 to 0.
由于 HPA 的 scale-in,我也遇到了类似的问题。
当您不写入 minReplicaCount 值时,该值默认设置为 0。然后,由于 HPA 的 scale-in.
,pod 被终止
我建议您设置所需的 minReplicaCount 值(例如 1)。
好吧,这是一件烦人的、小的、棘手的事情。
帐户中有另一个 EKS 集群,但在该集群中,cluster-autoscaler 是这样启动的:
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled
这 cluster-autoscaler 正在发现具有该标记的其他集群的所有节点,并在超时后杀死它们:15 分钟。
所以这里的教训是,每个cluster-autoscaler必须这样开始:
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled, k8s.io/cluster-autoscaler/clusterName
并且所有的缩放组都需要进行相应的标记。
不知是否有人可以帮助我。
Kubernetes(K8s 1.21 平台 eks.4)正在终止 运行ning pods,没有错误或原因。我在活动中唯一能看到的是:
7m47s Normal Killing pod/test-job-6c9fn-qbzkb Stopping container test-job
因为我设置了反亲和规则,一个节点只能有一个pod可以运行。因此,每次一个 pod 被杀死时,autoscaler 都会启动另一个节点。
这些是集群自动缩放器日志
I0208 19:10:42.336476 1 cluster.go:148] Fast evaluation: ip-10-4-127-38.us-west-2.compute.internal for removal
I0208 19:10:42.336484 1 cluster.go:169] Fast evaluation: node ip-10-4-127-38.us-west-2.compute.internal cannot be removed: pod annotated as not safe to evict present: test-job-6c9fn-qbzkb
I0208 19:10:42.336493 1 scale_down.go:612] 1 nodes found to be unremovable in simulation, will re-check them at 2022-02-08 19:15:42.335305238 +0000 UTC m=+20363.008486077
I0208 19:15:04.360683 1 klogx.go:86] Pod default/test-job-6c9fn-8wx2q is unschedulable
I0208 19:15:04.360719 1 scale_up.go:376] Upcoming 0 nodes
I0208 19:15:04.360861 1 scale_up.go:300] Pod test-job-6c9fn-8wx2q can't be scheduled on eks-ec2-8xlarge-84bf6ad9-ca4a-4293-a3e8-95bef28db16d, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
I0208 19:15:04.360901 1 scale_up.go:449] No pod can fit to eks-ec2-8xlarge-84bf6ad9-ca4a-4293-a3e8-95bef28db16d
I0208 19:15:04.361035 1 scale_up.go:300] Pod test-job-6c9fn-8wx2q can't be scheduled on eks-ec2-inf1-90bf6ad9-caf7-74e8-c930-b80f785bc743, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
I0208 19:15:04.361062 1 scale_up.go:449] No pod can fit to eks-ec2-inf1-90bf6ad9-caf7-74e8-c930-b80f785bc743
I0208 19:15:04.361162 1 scale_up.go:300] Pod test-job-6c9fn-8wx2q can't be scheduled on eks-ec2-large-62bf6ad9-ccd4-6e03-5c78-c3366d387d50, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
I0208 19:15:04.361194 1 scale_up.go:449] No pod can fit to eks-ec2-large-62bf6ad9-ccd4-6e03-5c78-c3366d387d50
I0208 19:15:04.361512 1 scale_up.go:412] Skipping node group eks-eks-on-demand-10bf6ad9-c978-9b35-c7fc-cdb9977b27cb - max size reached
I0208 19:15:04.361675 1 scale_up.go:300] Pod test-job-6c9fn-8wx2q can't be scheduled on eks-ec2-test-58bf6d43-13e8-9acc-5173-b8c5054a56da, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
I0208 19:15:04.361711 1 scale_up.go:449] No pod can fit to eks-ec2-test-58bf6d43-13e8-9acc-5173-b8c5054a56da
I0208 19:15:04.361723 1 waste.go:57] Expanding Node Group eks-ec2-xlarge-84bf6ad9-cb6d-7e24-7eb5-a00c369fd82f would waste 75.00% CPU, 86.92% Memory, 80.96% Blended
I0208 19:15:04.361747 1 scale_up.go:468] Best option to resize: eks-ec2-xlarge-84bf6ad9-cb6d-7e24-7eb5-a00c369fd82f
I0208 19:15:04.361762 1 scale_up.go:472] Estimated 1 nodes needed in eks-ec2-xlarge-84bf6ad9-cb6d-7e24-7eb5-a00c369fd82f
I0208 19:15:04.361780 1 scale_up.go:586] Final scale-up plan: [{eks-ec2-xlarge-84bf6ad9-cb6d-7e24-7eb5-a00c369fd82f 0->1 (max: 2)}]
I0208 19:15:04.361801 1 scale_up.go:675] Scale-up: setting group eks-ec2-xlarge-84bf6ad9-cb6d-7e24-7eb5-a00c369fd82f size to 1
I0208 19:15:04.361826 1 auto_scaling_groups.go:219] Setting asg eks-ec2-xlarge-84bf6ad9-cb6d-7e24-7eb5-a00c369fd82f size to 1
I0208 19:15:04.362154 1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"81b80048-920c-4bf1-b2c0-ad5d067d74f4", APIVersion:"v1", ResourceVersion:"359476", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group eks-ec2-xlarge-84bf6ad9-cb6d-7e24-7eb5-a00c369fd82f size to 1
I0208 19:15:04.374021 1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"81b80048-920c-4bf1-b2c0-ad5d067d74f4", APIVersion:"v1", ResourceVersion:"359476", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group eks-ec2-xlarge-84bf6ad9-cb6d-7e24-7eb5-a00c369fd82f size to 1
I0208 19:15:04.541658 1 eventing_scale_up_processor.go:47] Skipping event processing for unschedulable pods since there is a ScaleUp attempt this loop
I0208 19:15:04.541859 1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"test-job-6c9fn-8wx2q", UID:"67beba1d-4f52-4860-91af-89e5852e4cad", APIVersion:"v1", ResourceVersion:"359507", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{eks-ec2-xlarge-84bf6ad9-cb6d-7e24-7eb5-a00c369fd82f 0->1 (max: 2)}]
我正在 运行使用集群自动缩放器和 keda 的 aws-sqs 触发器构建 EKS 集群。我已经使用 SPOT 实例设置了一个自动缩放节点组。
出于测试目的,我定义了一个 ScaledJob,它由一个容器组成,带有一个简单的 python 脚本,循环遍历 time.sleep。 Pod 应该 运行 30 分钟。但它永远不会走到这一步。一般15分钟后结束
{
"apiVersion": "keda.sh/v1alpha1",
"kind": "ScaledJob",
"metadata": {
"name": id,
"labels": {"myjobidentifier": id},
"annotations": {
"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
},
},
"spec": {
"jobTargetRef": {
"parallelism": 1,
"completions": 1,
"backoffLimit": 0,
"template": {
"metadata": {
"labels": {"job-type": id},
"annotations": {
"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
},
},
"spec": {
"affinity": {
"nodeAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": {
"nodeSelectorTerms": [
{
"matchExpressions": [
{
"key": "eks.amazonaws.com/nodegroup",
"operator": "In",
"values": group_size,
}
]
}
]
}
},
"podAntiAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": [
{
"labelSelector": {
"matchExpressions": [
{
"key": "job-type",
"operator": "In",
"values": [id],
}
]
},
"topologyKey": "kubernetes.io/hostname",
}
]
},
},
"serviceAccountName": service_account.service_account_name,
"containers": [
{
"name": id,
"image": image.image_uri,
"imagePullPolicy": "IfNotPresent",
"env": envs,
"resources": {
"requests": requests,
},
"volumeMounts": [
{
"mountPath": "/tmp",
"name": "tmp-volume",
}
],
}
],
"volumes": [
{"name": "tmp-volume", "emptyDir": {}}
],
"restartPolicy": "Never",
},
},
},
"pollingInterval": 30,
"successfulJobsHistoryLimit": 10,
"failedJobsHistoryLimit": 100,
"maxReplicaCount": 30,
"rolloutStrategy": "default",
"scalingStrategy": {"strategy": "default"},
"triggers": [
{
"type": "aws-sqs-queue",
"metadata": {
"queueURL": queue.queue_url,
"queueLength": "1",
"awsRegion": region,
"identityOwner": "operator",
},
}
],
},
}
我知道这不是资源问题(虚拟代码和大实例),也不是驱逐问题(从日志中可以清楚地看出 pod 不会被驱逐),但我真的不知道如何不再解决此问题。
非常感谢!!
编辑:
与按需实例和 SPOT 实例的行为相同。
编辑 2:
我添加了 aws 节点终止处理程序,现在我似乎看到了其他事件:
ip-10-4-126-234.us-west-2.compute.internal.16d223107de38c5f
NodeNotSchedulable
Node ip-10-4-126-234.us-west-2.compute.internal status is now: NodeNotSchedulable
test-job-p85f2-txflr.16d2230ea91217a9
FailedScheduling
0/2 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) were unschedulable.
如果我检查缩放组activity:
Instance i-03d27a1cf341405e1 was taken out of service in response to a user request, shrinking the capacity from 1 to 0.
由于 HPA 的 scale-in,我也遇到了类似的问题。
当您不写入 minReplicaCount 值时,该值默认设置为 0。然后,由于 HPA 的 scale-in.
,pod 被终止我建议您设置所需的 minReplicaCount 值(例如 1)。
好吧,这是一件烦人的、小的、棘手的事情。
帐户中有另一个 EKS 集群,但在该集群中,cluster-autoscaler 是这样启动的:
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled
这 cluster-autoscaler 正在发现具有该标记的其他集群的所有节点,并在超时后杀死它们:15 分钟。
所以这里的教训是,每个cluster-autoscaler必须这样开始:
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled, k8s.io/cluster-autoscaler/clusterName
并且所有的缩放组都需要进行相应的标记。