SCDF:pod 启动失败时的错误处理
SCDF: Error handling when pod failed to start
我正在开发一项服务,它将调用 Spring Cloud Dataflow (SCDF) 为 Spring 批处理作业分拆一个新的 k8s Pod。
Map<String, String> properties = Map.of("testApp.cpu", cpu, "testApp.memory", memory);
LOGGER.info("Create task '{}' with definition '{}'", taskName, taskDefinition);
taskOperations.create(taskName, taskDefinition);
LOGGER.info("Launching task '{}' with properties {} and arguments '{}'", taskName, properties, args);
return taskOperations.launch(taskName, properties, args);
一切正常。问题是,每当我们拉取一个不存在的图像时(例如:由于某些连接问题),pod 无法启动并且我们最终会遇到待处理的任务(没有创建任何批处理作业)
例如,我们将在 table task_execution
(SCDF table) 中有任务,结束时间为空
但 batch_job_execution
table 没有相关工作。
一开始看起来还不错,因为没有创建 pod,我们不消耗任何资源。但是当“待处理作业”的数量达到 20 时,我们出现了著名的错误:
Cannot launch task testApp. The maximum concurrent task executions is at its limit [20]
我正在尝试找到一种方法来检测 pod 分拆失败(因此我们应该将任务标记为错误),但无济于事。
有没有办法在任务启动新的 k8s pod 时检测任务启动是否失败?
更新
不确定是否相关,我们使用的是 SCDF 1.7。3.RELEASE
描述失败的 pod:
Name: podname-lp2nyowgmm
Namespace: my-namespace
Priority: 1000
Priority Class Name: test-cluster-default
Node: some-ip.compute.internal/XX.XXX.XXX.XX
Start Time: Thu, 14 Jan 2021 18:47:52 +0700
Labels: role=spring-app
spring-app-id=podname-lp2nyowgmm
spring-deployment-id=podname-lp2nyowgmm
task-name=podname
Annotations: iam.amazonaws.com/role: arn:aws:iam::XXXXXXXXXXXX:role/svc-XXXX-XXX-XX-XXXX-X-XXX-XXX-XXXXXXXXXXXXXXXXXXXX
kubernetes.io/psp: eks.privileged
Status: Pending
IP: XX.XXX.XXX.XXX
IPs:
IP: XX.XXX.XXX.XXX
Containers:
podname-lp2nyowgmm:
Container ID:
Image: image_host:XXX/mysystem/myapp:notExist
Image ID:
Port: <none>
Host Port: <none>
Args:
--spring.datasource.username=postgres
--spring.cloud.task.name=podname
--spring.datasource.url=jdbc:postgresql://...
--spring.datasource.driverClassName=org.postgresql.Driver
--spring.datasource.password=XXXX
--fileId=XXXXXXXXXXX
--spring.application.name=app-name
--fileName=file_name.csv
...
--spring.cloud.task.executionid=3
State: Waiting
Reason: ErrImagePull
Ready: False
Restart Count: 0
Limits:
cpu: 2
memory: 8Gi
Requests:
cpu: 2
memory: 8Gi
Environment:
ELASTIC_SEARCH_PORT: 80
ELASTIC_SEARCH_PROTOCOL: http
SPRING_RABBITMQ_PORT: ${RABBITMQ_SERVICE_PORT}
ELASTIC_SEARCH_URL: elasticsearch
SPRING_PROFILES_ACTIVE: kubernetes
CLIENT_SECRET: ${CLIENT_SECRET}
SPRING_RABBITMQ_HOST: ${RABBITMQ_SERVICE_HOST}
RELEASE_ENV_NAME: QA_TEST
SPRING_CLOUD_APPLICATION_GUID: ${HOSTNAME}
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx(ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-xxxxx:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-xxxxx
Optional: false
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m22s default-scheduler Successfully assigned my-namespace/podname-lp2nyowgmm to some-ip.compute.internal
Normal Pulling 103s (x4 over 3m21s) kubelet Pulling image "image_host:XXX/mysystem/myapp:notExist"
Warning Failed 102s (x4 over 3m19s) kubelet Failed to pull image "image_host:XXX/mysystem/myapp:notExist": rpc error: code = Unknown desc = Error response from daemon: manifest for image_host:XXX/mysystem/myapp:notExist not found: manifest unknown: manifest unknown
Warning Failed 102s (x4 over 3m19s) kubelet Error: ErrImagePull
Normal BackOff 88s (x6 over 3m19s) kubelet Back-off pulling image "image_host:XXX/mysystem/myapp:notExist"
Warning Failed 73s (x7 over 3m19s) kubelet Error: ImagePullBackOff
感谢提问。查看source code,我们在计算当前执行任务数时不包括Pending
pods。这可能是其他事情正在发生。 1) 你能在 Pod 处于这种状态时 运行 kubectl describe pod
并 post 结果吗? (状态详情)。 2) 部署器是否配置为为每个任务创建一个作业? (默认为假)。
1.7.3 是一个非常老的版本。我们刚刚发布了 2.7。原始逻辑使用任务执行表而不是 pod 状态。如果您使用的版本受此约束,那么它将解释您所看到的内容。我强烈建议升级。
我正在开发一项服务,它将调用 Spring Cloud Dataflow (SCDF) 为 Spring 批处理作业分拆一个新的 k8s Pod。
Map<String, String> properties = Map.of("testApp.cpu", cpu, "testApp.memory", memory);
LOGGER.info("Create task '{}' with definition '{}'", taskName, taskDefinition);
taskOperations.create(taskName, taskDefinition);
LOGGER.info("Launching task '{}' with properties {} and arguments '{}'", taskName, properties, args);
return taskOperations.launch(taskName, properties, args);
一切正常。问题是,每当我们拉取一个不存在的图像时(例如:由于某些连接问题),pod 无法启动并且我们最终会遇到待处理的任务(没有创建任何批处理作业)
例如,我们将在 table task_execution
(SCDF table) 中有任务,结束时间为空
但 batch_job_execution
table 没有相关工作。
一开始看起来还不错,因为没有创建 pod,我们不消耗任何资源。但是当“待处理作业”的数量达到 20 时,我们出现了著名的错误:
Cannot launch task testApp. The maximum concurrent task executions is at its limit [20]
我正在尝试找到一种方法来检测 pod 分拆失败(因此我们应该将任务标记为错误),但无济于事。
有没有办法在任务启动新的 k8s pod 时检测任务启动是否失败?
更新
不确定是否相关,我们使用的是 SCDF 1.7。3.RELEASE
描述失败的 pod:
Name: podname-lp2nyowgmm
Namespace: my-namespace
Priority: 1000
Priority Class Name: test-cluster-default
Node: some-ip.compute.internal/XX.XXX.XXX.XX
Start Time: Thu, 14 Jan 2021 18:47:52 +0700
Labels: role=spring-app
spring-app-id=podname-lp2nyowgmm
spring-deployment-id=podname-lp2nyowgmm
task-name=podname
Annotations: iam.amazonaws.com/role: arn:aws:iam::XXXXXXXXXXXX:role/svc-XXXX-XXX-XX-XXXX-X-XXX-XXX-XXXXXXXXXXXXXXXXXXXX
kubernetes.io/psp: eks.privileged
Status: Pending
IP: XX.XXX.XXX.XXX
IPs:
IP: XX.XXX.XXX.XXX
Containers:
podname-lp2nyowgmm:
Container ID:
Image: image_host:XXX/mysystem/myapp:notExist
Image ID:
Port: <none>
Host Port: <none>
Args:
--spring.datasource.username=postgres
--spring.cloud.task.name=podname
--spring.datasource.url=jdbc:postgresql://...
--spring.datasource.driverClassName=org.postgresql.Driver
--spring.datasource.password=XXXX
--fileId=XXXXXXXXXXX
--spring.application.name=app-name
--fileName=file_name.csv
...
--spring.cloud.task.executionid=3
State: Waiting
Reason: ErrImagePull
Ready: False
Restart Count: 0
Limits:
cpu: 2
memory: 8Gi
Requests:
cpu: 2
memory: 8Gi
Environment:
ELASTIC_SEARCH_PORT: 80
ELASTIC_SEARCH_PROTOCOL: http
SPRING_RABBITMQ_PORT: ${RABBITMQ_SERVICE_PORT}
ELASTIC_SEARCH_URL: elasticsearch
SPRING_PROFILES_ACTIVE: kubernetes
CLIENT_SECRET: ${CLIENT_SECRET}
SPRING_RABBITMQ_HOST: ${RABBITMQ_SERVICE_HOST}
RELEASE_ENV_NAME: QA_TEST
SPRING_CLOUD_APPLICATION_GUID: ${HOSTNAME}
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx(ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-xxxxx:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-xxxxx
Optional: false
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m22s default-scheduler Successfully assigned my-namespace/podname-lp2nyowgmm to some-ip.compute.internal
Normal Pulling 103s (x4 over 3m21s) kubelet Pulling image "image_host:XXX/mysystem/myapp:notExist"
Warning Failed 102s (x4 over 3m19s) kubelet Failed to pull image "image_host:XXX/mysystem/myapp:notExist": rpc error: code = Unknown desc = Error response from daemon: manifest for image_host:XXX/mysystem/myapp:notExist not found: manifest unknown: manifest unknown
Warning Failed 102s (x4 over 3m19s) kubelet Error: ErrImagePull
Normal BackOff 88s (x6 over 3m19s) kubelet Back-off pulling image "image_host:XXX/mysystem/myapp:notExist"
Warning Failed 73s (x7 over 3m19s) kubelet Error: ImagePullBackOff
感谢提问。查看source code,我们在计算当前执行任务数时不包括Pending
pods。这可能是其他事情正在发生。 1) 你能在 Pod 处于这种状态时 运行 kubectl describe pod
并 post 结果吗? (状态详情)。 2) 部署器是否配置为为每个任务创建一个作业? (默认为假)。
1.7.3 是一个非常老的版本。我们刚刚发布了 2.7。原始逻辑使用任务执行表而不是 pod 状态。如果您使用的版本受此约束,那么它将解释您所看到的内容。我强烈建议升级。