Kubernetes 中请求的具体用途是什么

Question

我对 cgroup 的 requests 和 cpu.shares 这两个参数之间的关系感到困惑，这两个参数会在部署 Pod 后更新。根据我到目前为止所做的阅读，cpu.shares 反映了在尝试获得使用 CPU 的机会时的某种优先级。而且是相对值。

所以我的问题是为什么kubernetes在调度时将CPU的request值视为绝对值？当涉及到 CPU 个进程时，将根据它们的优先级（根据 CFS 机制）获得一个时间片来执行。据我所知，没有所谓的给予如此数量的 CPUs（1CPU、2CPUs 等）。那么，如果 cpu.share 值被认为是任务的优先级，为什么 kubernetes 会考虑确切的请求值（例如：1500m，200m）来找出一个节点？

如有错误请指正。谢谢！！

Answer 1

根据主要问题和评论回答您的问题：

So my question why kubernetes considers the request value of the CPU as an absolute value when scheduling?

To my knowledge, there's no such thing called giving such amounts of CPUs (1CPU, 2CPUs etc.). So, if the cpu.share value is considered to prioritize the tasks, why kubernetes consider the exact request value (Eg: 1500m, 200m) to find out a node?

这是因为来自请求 are always converted to the values in milicores, like 0.1 is equal to 100m which can be read as "one hundred millicpu" or "one hundred millicores" 的十进制 CPU 值。这些单位特定于 Kubernetes：

Fractional requests are allowed. A Container with spec.containers[].resources.requests.cpu of 0.5 is guaranteed half as much CPU as one that asks for 1 CPU. The expression 0.1 is equivalent to the expression 100m, which can be read as "one hundred millicpu". Some people say "one hundred millicores", and this is understood to mean the same thing. A request with a decimal point, like 0.1, is converted to 100m by the API, and precision finer than 1m is not allowed. For this reason, the form 100m might be preferred.

CPU is always requested as an absolute quantity, never as a relative quantity; 0.1 is the same amount of CPU on a single-core, dual-core, or 48-core machine.

基于以上，请记住，您可以通过指定 cpu: 1.5 或 cpu: 1500m.

来指定使用节点的 1.5 CPU

Just wanna know lowering the cpu.share value in cgroups (which is modified by k8s after the deployment) affects to the cpu power consume by the process. For an instance, assume that A, B containers have 1024, 2048 shares allocated. So the available resources will be split into 1:2 ratio. So would it be the same as if we configure cpu.share as 10, 20 for two containers. Still the ratio is 1:2

说清楚——确实是比例相同，但数值不同。 cpu.shares中的1024和2048表示Kubernetes资源中定义的cpu: 1000m和cpu: 2000m，而10和20表示cpu: 10m和cpu: 20m.

Let's say the cluster nodes are based on Linux OS. So, how kubernetes ensure that request value is given to a container? Ultimately, OS will use configurations available in a cgroup to allocate resource, right? It modifies the cpu.shares value of the cgroup. So my question is, which files is modified by k8s to tell operating system to give 100m or 200m to a container?

是的，你的想法是正确的。让我更详细地解释一下。

一般在Kubernetes节点there are three cgroups under the root cgroup上，命名为slices:

The k8s uses cpu.share file to allocate the CPU resources. In this case, the root cgroup inherits 4096 CPU shares, which are 100% of available CPU power(1 core = 1024; this is fixed value). The root cgroup allocate its share proportionally based on children’s cpu.share and they do the same with their children and so on. In typical Kubernetes nodes, there are three cgroup under the root cgroup, namely system.slice, user.slice, and kubepods. The first two are used to allocate the resource for critical system workloads and non-k8s user space programs. The last one, kubepods is created by k8s to allocate the resource to pods.

要检查修改了哪些文件，我们需要转到 /sys/fs/cgroup/cpu 目录。 Here we can find directory called kubepods (which is one of the above mentioned slices) where all cpu.shares files for pods are here. In kubepods directory we can find two other folders - besteffort and burstable. Here is worth mentioning that Kubernetes have a three QoS classes:

每个 pod 都有一个分配的 QoS class 并且根据它是哪个 class，pod 位于相应的目录中（保证除外，创建具有此 class 的 pod在 kubepods 目录中）。

例如，我正在创建一个具有以下定义的广告连播：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-deployment
spec:
  selector:
    matchLabels:
      app: test-deployment
  replicas: 2 # tells deployment to run 2 pods matching the template
  template:
    metadata:
      labels:
        app: test-deployment
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: 300m
      - name: busybox
        image: busybox
        args:
        - sleep
        - "999999"
        resources:
          requests:
            cpu: 150m

根据前面提到的定义，此 pod 将分配 Qos class Burstable，因此它将在 /sys/fs/cgroup/cpu/kubepods/burstable 目录中创建。

现在我们可以检查为此 pod 设置的 cpu.shares：

user@cluster /sys/fs/cgroup/cpu/kubepods/burstable/podf13d6898-69f9-44eb-8ea6-5284e1778f90 $ cat cpu.shares
460

它是正确的，因为一个吊舱需要 300 米，第二个吊舱需要 150 米，它是 calculated by multiplying 1024。对于每个容器，我们还有子目录：

user@cluster /sys/fs/cgroup/cpu/kubepods/burstable/podf13d6898-69f9-44eb-8ea6-5284e1778f90/fa6194cbda0ccd0b1dc77793bfbff608064aa576a5a83a2f1c5c741de8cf019a $ cat cpu.shares
153
user@cluster /sys/fs/cgroup/cpu/kubepods/burstable/podf13d6898-69f9-44eb-8ea6-5284e1778f90/d5ba592186874637d703544ceb6f270939733f6292e1fea7435dd55b6f3f1829 $ cat cpu.shares
307

如果您想了解更多关于 Kubrenetes CPU 管理的信息，我建议您阅读以下内容：

Kubernetes 中请求的具体用途是什么

What is the Exact use of requests in Kubernetes

cpu

cgroups

kubernetes