如何配置 K8s 集群以利用备用 CPU 容量进行 ML 训练作业（或其他低优先级 CPU 密集型工作）

Question

我想在我们的 kubernetes 集群中使用空闲的 CPU 容量来处理低优先级的工作——特别是在这种情况下使用 Tensorflow 的 ML 训练——而不剥夺我们集群上更高优先级的服务 CPU 当它们突然飙升时，类似于 OS 进程优先级。目前，我们将自动缩放器配置为在 CPU 使用率超过 60% 时添加更多节点，这意味着我们的 CPU 中有多达 40% 始终未使用。

问题： (1) K8s 可以吗？经过一些实验，似乎 Pod priority 不是完全一样，因为我的低优先级部署不会立即让步 CPU 到我的高优先级部署。 (2) 如果不可能，是否有另一种普遍使用的策略来利用故意过度配置的 CPU 容量，但立即让位于更高优先级的服务？

Answer 1

根据https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/resource-qos.md#qos-classes

In an overcommitted system (where sum of limits > machine capacity) containers might eventually have to be killed, for example if the system runs out of CPU or memory resources. Ideally, we should kill containers that are less important. For each resource, we divide containers into 3 QoS classes: Guaranteed, Burstable, and Best-Effort, in decreasing order of priority.

你可以这样做：

将高设置为保证

containers:
  name: high
    resources:
      limits:
        cpu: 8000m
        memory: 8Gi

将 ml-job 设置为 Best-Effort。

containers:
  name: ml-job

我不确定您的 ml-job 是否可以终止。如果不是，那么这个策略可能不适合你。

如何配置 K8s 集群以利用备用 CPU 容量进行 ML 训练作业（或其他低优先级 CPU 密集型工作）

How to configure K8s cluster to utilize spare CPU capacity for ML training jobs (or other low-priority CPU-intensive work)

kubernetes

cost-management

google-kubernetes-engine

devops

kubernetes-pod