SCDF:pod 启动失败时的错误处理

SCDF: Error handling when pod failed to start

我正在开发一项服务,它将调用 Spring Cloud Dataflow (SCDF) 为 Spring 批处理作业分拆一个新的 k8s Pod。

Map<String, String> properties = Map.of("testApp.cpu", cpu, "testApp.memory", memory);
LOGGER.info("Create task '{}' with definition '{}'", taskName, taskDefinition);
taskOperations.create(taskName, taskDefinition);

LOGGER.info("Launching task '{}' with properties {} and arguments '{}'", taskName, properties, args);
return taskOperations.launch(taskName, properties, args);

一切正常。问题是,每当我们拉取一个不存在的图像时(例如:由于某些连接问题),pod 无法启动并且我们最终会遇到待处理的任务(没有创建任何批处理作业)

例如,我们将在 table task_execution (SCDF table) 中有任务,结束时间为空

batch_job_execution table 没有相关工作。

一开始看起来还不错,因为没有创建 pod,我们不消耗任何资源。但是当“待处理作业”的数量达到 20 时,我们出现了著名的错误:

Cannot launch task testApp. The maximum concurrent task executions is at its limit [20]

我正在尝试找到一种方法来检测 pod 分拆失败(因此我们应该将任务标记为错误),但无济于事。

有没有办法在任务启动新的 k8s pod 时检测任务启动是否失败?

更新

不确定是否相关,我们使用的是 SCDF 1.7。3.RELEASE

描述失败的 pod:

Name:                 podname-lp2nyowgmm
Namespace:            my-namespace
Priority:             1000
Priority Class Name:  test-cluster-default
Node:                 some-ip.compute.internal/XX.XXX.XXX.XX
Start Time:           Thu, 14 Jan 2021 18:47:52 +0700
Labels:               role=spring-app
                      spring-app-id=podname-lp2nyowgmm
                      spring-deployment-id=podname-lp2nyowgmm
                      task-name=podname
Annotations:          iam.amazonaws.com/role: arn:aws:iam::XXXXXXXXXXXX:role/svc-XXXX-XXX-XX-XXXX-X-XXX-XXX-XXXXXXXXXXXXXXXXXXXX
                      kubernetes.io/psp: eks.privileged
Status:               Pending
IP:                   XX.XXX.XXX.XXX
IPs:
  IP:  XX.XXX.XXX.XXX
Containers:
  podname-lp2nyowgmm:
    Container ID:
    Image:         image_host:XXX/mysystem/myapp:notExist
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Args:
      --spring.datasource.username=postgres
      --spring.cloud.task.name=podname
      --spring.datasource.url=jdbc:postgresql://...
      --spring.datasource.driverClassName=org.postgresql.Driver
      --spring.datasource.password=XXXX
      --fileId=XXXXXXXXXXX
      --spring.application.name=app-name
      --fileName=file_name.csv
      ...
      --spring.cloud.task.executionid=3
    State:          Waiting
      Reason:       ErrImagePull
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  8Gi
    Requests:
      cpu:     2
      memory:  8Gi
    Environment:
      ELASTIC_SEARCH_PORT:               80
      ELASTIC_SEARCH_PROTOCOL:           http
      SPRING_RABBITMQ_PORT:              ${RABBITMQ_SERVICE_PORT}
      ELASTIC_SEARCH_URL:                elasticsearch
      SPRING_PROFILES_ACTIVE:            kubernetes
      CLIENT_SECRET:                     ${CLIENT_SECRET}
      SPRING_RABBITMQ_HOST:              ${RABBITMQ_SERVICE_HOST}
      RELEASE_ENV_NAME:                  QA_TEST
      SPRING_CLOUD_APPLICATION_GUID:     ${HOSTNAME}
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx(ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  default-token-xxxxx:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-xxxxx
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  3m22s                 default-scheduler  Successfully assigned my-namespace/podname-lp2nyowgmm to some-ip.compute.internal
  Normal   Pulling    103s (x4 over 3m21s)  kubelet            Pulling image "image_host:XXX/mysystem/myapp:notExist"
  Warning  Failed     102s (x4 over 3m19s)  kubelet            Failed to pull image "image_host:XXX/mysystem/myapp:notExist": rpc error: code = Unknown desc = Error response from daemon: manifest for image_host:XXX/mysystem/myapp:notExist not found: manifest unknown: manifest unknown
  Warning  Failed     102s (x4 over 3m19s)  kubelet            Error: ErrImagePull
  Normal   BackOff    88s (x6 over 3m19s)   kubelet            Back-off pulling image "image_host:XXX/mysystem/myapp:notExist"
  Warning  Failed     73s (x7 over 3m19s)   kubelet            Error: ImagePullBackOff

感谢提问。查看source code,我们在计算当前执行任务数时不包括Pendingpods。这可能是其他事情正在发生。 1) 你能在 Pod 处于这种状态时 运行 kubectl describe pod 并 post 结果吗? (状态详情)。 2) 部署器是否配置为为每个任务创建一个作业? (默认为假)。

1.7.3 是一个非常老的版本。我们刚刚发布了 2.7。原始逻辑使用任务执行表而不是 pod 状态。如果您使用的版本受此约束,那么它将解释您所看到的内容。我强烈建议升级。