Kubernetes 作业和回退限制值:该值是重试次数还是分钟数?

Kubernetes jobs and back-off limit values: is the value a number of retries or minutes?

我是 reading the Kubernetes documentation about jobs and retries。我发现了这个:

There are situations where you want to fail a Job after some amount of retries due to a logical error in configuration etc. To do so, set .spec.backoffLimit to specify the number of retries before considering a Job as failed. The back-off limit is set by default to 6. Failed Pods associated with the Job are recreated by the Job controller with an exponential back-off delay (10s, 20s, 40s …) capped at six minutes. The back-off count is reset if no new failed Pods appear before the Job’s next status check.

我对上面的引述有两个问题:

  1. 退避限制值是分钟数还是重试次数?使用值 6 (six) 的文档示例很混乱,因为他最初确认该值是重试次数但之后说 "capped at six minutes".
  2. 有没有办法定义退避延迟时间?据我了解,这种行为(10 秒、20 秒、40 秒……)是默认的,无法更改。

不要混淆 .spec.backoffLimit 是重试次数。

作业控制器以指数延迟(10 秒、20 秒、40 秒、...、360 秒)重新创建失败的 Pods(与作业相关)。当然,这个延迟时间是由Job控制器设置的。

  • 如果Pod失败,10s后会创建新的Pod
  • 如果再次失败,20s后会创建新的
  • 如果再次失败,40s后新的出现
  • 如果再次失败,下一个会在80s(1m 20s)后出现
  • 如果再次失败,下一个在160s(2m 40s)后出现
  • 如果再次失败,320s(5m 20s)后,新的Pod来
  • 如果再次失败,360s后(不是640s,因为大于360s或6m)你会看到下一个