为什么我的 Kubernetes Cronjob pod 在执行时会被杀死?
Why does my Kubernetes Cronjob pod get killed while executing?
Kubernetes 版本
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-13T02:40:46Z", GoVersion:"go1.16.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"e1d093448d0ed9b9b1a48f49833ff1ee64c05ba5", GitTreeState:"clean", BuildDate:"2021-06-03T00:20:57Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}
我有一个 Kubernetes crobjob,目的是 运行按基于时间的计划执行一些 Azure cli 命令。
运行 容器在本地工作正常,但是,通过 Lens 手动触发 Cronjob,或者按照计划让它 运行 导致奇怪的行为(运行在云中作为一项工作会产生意想不到的结果。
这里是 cronjob 定义:
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: development-scale-down
namespace: development
spec:
schedule: "0 22 * * 0-4"
concurrencyPolicy: Allow
startingDeadlineSeconds: 60
failedJobsHistoryLimit: 5
jobTemplate:
spec:
backoffLimit: 0 # Do not retry
activeDeadlineSeconds: 360 # 5 minutes
template:
spec:
containers:
- name: scaler
image: myimage:latest
imagePullPolicy: Always
env: ...
restartPolicy: "Never"
我 运行 手动执行 cronjob,它创建了作业 development-scale-down-manual-xwp1k
。完成后描述此作业,我们可以看到以下内容:
$ kubectl describe job development-scale-down-manual-xwp1k
Name: development-scale-down-manual-xwp1k
Namespace: development
Selector: controller-uid=ecf8fb47-cd50-42eb-9a6f-888f7e2c9257
Labels: controller-uid=ecf8fb47-cd50-42eb-9a6f-888f7e2c9257
job-name=development-scale-down-manual-xwp1k
Annotations: <none>
Parallelism: 1
Completions: 1
Start Time: Wed, 04 Aug 2021 09:40:28 +1200
Active Deadline Seconds: 360s
Pods Statuses: 0 Running / 0 Succeeded / 1 Failed
Pod Template:
Labels: controller-uid=ecf8fb47-cd50-42eb-9a6f-888f7e2c9257
job-name=development-scale-down-manual-xwp1k
Containers:
scaler:
Image: myimage:latest
Port: <none>
Host Port: <none>
Environment:
CLUSTER_NAME: ...
NODEPOOL_NAME: ...
NODEPOOL_SIZE: ...
RESOURCE_GROUP: ...
SP_APP_ID: <set to the key 'application_id' in secret 'scaler-secrets'> Optional: false
SP_PASSWORD: <set to the key 'application_pass' in secret 'scaler-secrets'> Optional: false
SP_TENANT: <set to the key 'application_tenant' in secret 'scaler-secrets'> Optional: false
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 24m job-controller Created pod: development-scale-down-manual-xwp1k-b858c
Normal SuccessfulCreate 23m job-controller Created pod: development-scale-down-manual-xwp1k-xkkw9
Warning BackoffLimitExceeded 23m job-controller Job has reached the specified backoff limit
这与 不同,后者没有提及“SuccessfulDelete”事件。
从kubectl get events
收到的事件讲述了一个有趣的故事
$ ktl get events | grep xwp1k
3m19s Normal Scheduled pod/development-scale-down-manual-xwp1k-b858c Successfully assigned development/development-scale-down-manual-xwp1k-b858c to aks-burst-37275452-vmss00000d
3m18s Normal Pulling pod/development-scale-down-manual-xwp1k-b858c Pulling image "myimage:latest"
2m38s Normal Pulled pod/development-scale-down-manual-xwp1k-b858c Successfully pulled image "myimage:latest" in 40.365655229s
2m23s Normal Created pod/development-scale-down-manual-xwp1k-b858c Created container myimage
2m23s Normal Started pod/development-scale-down-manual-xwp1k-b858c Started container myimage
2m12s Normal Killing pod/development-scale-down-manual-xwp1k-b858c Stopping container myimage
2m12s Normal Scheduled pod/development-scale-down-manual-xwp1k-xkkw9 Successfully assigned development/development-scale-down-manual-xwp1k-xkkw9 to aks-default-37275452-vmss000002
2m12s Normal Pulling pod/development-scale-down-manual-xwp1k-xkkw9 Pulling image "myimage:latest"
2m11s Normal Pulled pod/development-scale-down-manual-xwp1k-xkkw9 Successfully pulled image "myimage:latest" in 751.93652ms
2m10s Normal Created pod/development-scale-down-manual-xwp1k-xkkw9 Created container myimage
2m10s Normal Started pod/development-scale-down-manual-xwp1k-xkkw9 Started container myimage
3m19s Normal SuccessfulCreate job/development-scale-down-manual-xwp1k Created pod: development-scale-down-manual-xwp1k-b858c
2m12s Normal SuccessfulCreate job/development-scale-down-manual-xwp1k Created pod: development-scale-down-manual-xwp1k-xkkw9
2m1s Warning BackoffLimitExceeded job/development-scale-down-manual-xwp1k Job has reached the specified backoff limit
我无法弄清楚为什么容器被杀死,日志看起来都很好并且没有资源限制。容器被删除得非常快,这意味着我几乎没有时间进行调试。更详细的事件行是这样写的
3m54s Normal Killing pod/development-scale-down-manual-xwp1k-b858c spec.containers{myimage} kubelet, aks-burst-37275452-vmss00000d Stopping container myimage 3m54s 1 development-scale-down-manual-xwp1k-b858c.1697e9d5e5b846ef
我注意到图像拉取最初需要几秒钟 (40),这是否有助于超过 startingDeadline 或其他 cron 规范?
感谢任何想法或帮助,谢谢
正在阅读日志!总是有帮助。
上下文
对于上下文,作业本身扩展了 AKS 节点池。我们有两个,默认的 system
一个,一个新的用户控制的。 cronjob 旨在扩展新的 user
(不是 system
池)。
正在调查
我注意到 scale-down
作业总是比 scale-up
作业花费更长的时间,这是因为在缩减作业 运行 时图像拉取总是发生。
我也注意到,上面提到的Killing
事件,是源于kubelet。 (kubectl get events -o wide
)
我去查看主机上的 kubelet 日志,发现主机名有点不典型(aks-burst-XXXXXXXX-vmss00000d
),因为我们小型开发集群中的大多数主机通常在末尾都有数字, 不是 d
我意识到命名是不同的,因为这个节点不是默认节点池的一部分,而且我无法检查 kubelet 日志,因为主机已被删除。
原因
作业缩减了计算资源。缩小会失败,因为它总是先于放大,此时集群中有一个新节点。此节点上没有任何 运行ning,因此下一个作业已安排在其上。作业在新节点上启动,告诉 Azure 将新节点缩小到 0,随后 Kubelet 终止了作业,因为它是 运行ning。
总是被安排在新节点上解释了为什么每次都发生镜像拉取。
修复
我更改了规范并添加了一个 NodeSelector,以便作业始终 运行 在 system
池上,这比 user
池
更稳定
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: development-scale-down
namespace: development
spec:
schedule: "0 22 * * 0-4"
concurrencyPolicy: Allow
startingDeadlineSeconds: 60
failedJobsHistoryLimit: 5
jobTemplate:
spec:
backoffLimit: 0 # Do not retry
activeDeadlineSeconds: 360 # 5 minutes
template:
spec:
containers:
- name: scaler
image: myimage:latest
imagePullPolicy: Always
env: ...
restartPolicy: "Never"
nodeSelector:
agentpool: default
Kubernetes 版本
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-13T02:40:46Z", GoVersion:"go1.16.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"e1d093448d0ed9b9b1a48f49833ff1ee64c05ba5", GitTreeState:"clean", BuildDate:"2021-06-03T00:20:57Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}
我有一个 Kubernetes crobjob,目的是 运行按基于时间的计划执行一些 Azure cli 命令。
运行 容器在本地工作正常,但是,通过 Lens 手动触发 Cronjob,或者按照计划让它 运行 导致奇怪的行为(运行在云中作为一项工作会产生意想不到的结果。
这里是 cronjob 定义:
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: development-scale-down
namespace: development
spec:
schedule: "0 22 * * 0-4"
concurrencyPolicy: Allow
startingDeadlineSeconds: 60
failedJobsHistoryLimit: 5
jobTemplate:
spec:
backoffLimit: 0 # Do not retry
activeDeadlineSeconds: 360 # 5 minutes
template:
spec:
containers:
- name: scaler
image: myimage:latest
imagePullPolicy: Always
env: ...
restartPolicy: "Never"
我 运行 手动执行 cronjob,它创建了作业 development-scale-down-manual-xwp1k
。完成后描述此作业,我们可以看到以下内容:
$ kubectl describe job development-scale-down-manual-xwp1k
Name: development-scale-down-manual-xwp1k
Namespace: development
Selector: controller-uid=ecf8fb47-cd50-42eb-9a6f-888f7e2c9257
Labels: controller-uid=ecf8fb47-cd50-42eb-9a6f-888f7e2c9257
job-name=development-scale-down-manual-xwp1k
Annotations: <none>
Parallelism: 1
Completions: 1
Start Time: Wed, 04 Aug 2021 09:40:28 +1200
Active Deadline Seconds: 360s
Pods Statuses: 0 Running / 0 Succeeded / 1 Failed
Pod Template:
Labels: controller-uid=ecf8fb47-cd50-42eb-9a6f-888f7e2c9257
job-name=development-scale-down-manual-xwp1k
Containers:
scaler:
Image: myimage:latest
Port: <none>
Host Port: <none>
Environment:
CLUSTER_NAME: ...
NODEPOOL_NAME: ...
NODEPOOL_SIZE: ...
RESOURCE_GROUP: ...
SP_APP_ID: <set to the key 'application_id' in secret 'scaler-secrets'> Optional: false
SP_PASSWORD: <set to the key 'application_pass' in secret 'scaler-secrets'> Optional: false
SP_TENANT: <set to the key 'application_tenant' in secret 'scaler-secrets'> Optional: false
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 24m job-controller Created pod: development-scale-down-manual-xwp1k-b858c
Normal SuccessfulCreate 23m job-controller Created pod: development-scale-down-manual-xwp1k-xkkw9
Warning BackoffLimitExceeded 23m job-controller Job has reached the specified backoff limit
这与
从kubectl get events
收到的事件讲述了一个有趣的故事
$ ktl get events | grep xwp1k
3m19s Normal Scheduled pod/development-scale-down-manual-xwp1k-b858c Successfully assigned development/development-scale-down-manual-xwp1k-b858c to aks-burst-37275452-vmss00000d
3m18s Normal Pulling pod/development-scale-down-manual-xwp1k-b858c Pulling image "myimage:latest"
2m38s Normal Pulled pod/development-scale-down-manual-xwp1k-b858c Successfully pulled image "myimage:latest" in 40.365655229s
2m23s Normal Created pod/development-scale-down-manual-xwp1k-b858c Created container myimage
2m23s Normal Started pod/development-scale-down-manual-xwp1k-b858c Started container myimage
2m12s Normal Killing pod/development-scale-down-manual-xwp1k-b858c Stopping container myimage
2m12s Normal Scheduled pod/development-scale-down-manual-xwp1k-xkkw9 Successfully assigned development/development-scale-down-manual-xwp1k-xkkw9 to aks-default-37275452-vmss000002
2m12s Normal Pulling pod/development-scale-down-manual-xwp1k-xkkw9 Pulling image "myimage:latest"
2m11s Normal Pulled pod/development-scale-down-manual-xwp1k-xkkw9 Successfully pulled image "myimage:latest" in 751.93652ms
2m10s Normal Created pod/development-scale-down-manual-xwp1k-xkkw9 Created container myimage
2m10s Normal Started pod/development-scale-down-manual-xwp1k-xkkw9 Started container myimage
3m19s Normal SuccessfulCreate job/development-scale-down-manual-xwp1k Created pod: development-scale-down-manual-xwp1k-b858c
2m12s Normal SuccessfulCreate job/development-scale-down-manual-xwp1k Created pod: development-scale-down-manual-xwp1k-xkkw9
2m1s Warning BackoffLimitExceeded job/development-scale-down-manual-xwp1k Job has reached the specified backoff limit
我无法弄清楚为什么容器被杀死,日志看起来都很好并且没有资源限制。容器被删除得非常快,这意味着我几乎没有时间进行调试。更详细的事件行是这样写的
3m54s Normal Killing pod/development-scale-down-manual-xwp1k-b858c spec.containers{myimage} kubelet, aks-burst-37275452-vmss00000d Stopping container myimage 3m54s 1 development-scale-down-manual-xwp1k-b858c.1697e9d5e5b846ef
我注意到图像拉取最初需要几秒钟 (40),这是否有助于超过 startingDeadline 或其他 cron 规范?
感谢任何想法或帮助,谢谢
正在阅读日志!总是有帮助。
上下文
对于上下文,作业本身扩展了 AKS 节点池。我们有两个,默认的 system
一个,一个新的用户控制的。 cronjob 旨在扩展新的 user
(不是 system
池)。
正在调查
我注意到 scale-down
作业总是比 scale-up
作业花费更长的时间,这是因为在缩减作业 运行 时图像拉取总是发生。
我也注意到,上面提到的Killing
事件,是源于kubelet。 (kubectl get events -o wide
)
我去查看主机上的 kubelet 日志,发现主机名有点不典型(aks-burst-XXXXXXXX-vmss00000d
),因为我们小型开发集群中的大多数主机通常在末尾都有数字, 不是 d
我意识到命名是不同的,因为这个节点不是默认节点池的一部分,而且我无法检查 kubelet 日志,因为主机已被删除。
原因
作业缩减了计算资源。缩小会失败,因为它总是先于放大,此时集群中有一个新节点。此节点上没有任何 运行ning,因此下一个作业已安排在其上。作业在新节点上启动,告诉 Azure 将新节点缩小到 0,随后 Kubelet 终止了作业,因为它是 运行ning。
总是被安排在新节点上解释了为什么每次都发生镜像拉取。
修复
我更改了规范并添加了一个 NodeSelector,以便作业始终 运行 在 system
池上,这比 user
池
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: development-scale-down
namespace: development
spec:
schedule: "0 22 * * 0-4"
concurrencyPolicy: Allow
startingDeadlineSeconds: 60
failedJobsHistoryLimit: 5
jobTemplate:
spec:
backoffLimit: 0 # Do not retry
activeDeadlineSeconds: 360 # 5 minutes
template:
spec:
containers:
- name: scaler
image: myimage:latest
imagePullPolicy: Always
env: ...
restartPolicy: "Never"
nodeSelector:
agentpool: default