为什么实际 CPU 利用率超过 Kubernetes 中的 Pod CPU 限制
Why the actual CPU utilization percentage exceeds Pod CPU limit in Kubernetes
我 运行 在我的集群(10 个节点)中安装了几个 kubernetes pods。每个 pod 仅包含一个容器,该容器承载一个工作进程。我已经为容器指定了 CPU "limits" 和 "requests" 。以下是对节点 (crypt12) 上 运行ning 的一个 pod 的描述。
Name: alexnet-worker-6-9954df99c-p7tx5
Namespace: default
Node: crypt12/172.16.28.136
Start Time: Sun, 15 Jul 2018 22:26:57 -0400
Labels: job=worker
name=alexnet
pod-template-hash=551089557
task=6
Annotations: <none>
Status: Running
IP: 10.38.0.1
Controlled By: ReplicaSet/alexnet-worker-6-9954df99c
Containers:
alexnet-v1-container:
Container ID: docker://214e30e87ed4a7240e13e764200a260a883ea4550a1b5d09d29ed827e7b57074
Image: alexnet-tf150-py3:v1
Image ID: docker://sha256:4f18b4c45a07d639643d7aa61b06bfee1235637a50df30661466688ab2fd4e6d
Port: 5000/TCP
Host Port: 0/TCP
Command:
/usr/bin/python3
cifar10_distributed.py
Args:
--data_dir=xxxx
State: Running
Started: Sun, 15 Jul 2018 22:26:59 -0400
Ready: True
Restart Count: 0
Limits:
cpu: 800m
memory: 6G
Requests:
cpu: 800m
memory: 6G
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-hfnlp (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
default-token-hfnlp:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-hfnlp
Optional: false
QoS Class: Guaranteed
Node-Selectors: kubernetes.io/hostname=crypt12
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
下面是我运行"kubectl describle node crypt12"
时的输出
Name: crypt12
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/hostname=crypt12
Annotations: kubeadm.alpha.kubernetes.io/cri-socket=/var/run/dockershim.sock
node.alpha.kubernetes.io/ttl=0
volumes.kubernetes.io/controller-managed-attach-detach=true
CreationTimestamp: Wed, 11 Jul 2018 23:07:41 -0400
Taints: <none>
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Mon, 16 Jul 2018 16:25:43 -0400 Wed, 11 Jul 2018 22:57:22 -0400 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Mon, 16 Jul 2018 16:25:43 -0400 Wed, 11 Jul 2018 22:57:22 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, 16 Jul 2018 16:25:43 -0400 Wed, 11 Jul 2018 22:57:22 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, 16 Jul 2018 16:25:43 -0400 Wed, 11 Jul 2018 22:57:22 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 16 Jul 2018 16:25:43 -0400 Wed, 11 Jul 2018 22:57:42 -0400 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 172.16.28.136
Hostname: crypt12
Capacity:
cpu: 8
ephemeral-storage: 144937600Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 8161308Ki
pods: 110
Allocatable:
cpu: 8
ephemeral-storage: 133574491939
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 8058908Ki
pods: 110
System Info:
Machine ID: f0444e00ba2ed20e5314e6bc5b0f0f60
System UUID: 37353035-3836-5355-4530-32394E44414D
Boot ID: cf2a9daf-c959-4c7e-be61-5e44a44670c4
Kernel Version: 4.4.0-87-generic
OS Image: Ubuntu 16.04.3 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://1.13.1
Kubelet Version: v1.11.0
Kube-Proxy Version: v1.11.0
Non-terminated Pods: (3 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
default alexnet-worker-6-9954df99c-p7tx5 800m (10%) 800m (10%) 6G (72%) 6G (72%)
kube-system kube-proxy-7kdkd 0 (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system weave-net-dpclj 20m (0%) 0 (0%) 0 (0%) 0 (0%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 820m (10%) 800m (10%)
memory 6G (72%) 6G (72%)
Events: <none>
如图所示,在节点描述("Non-terminated Pods" 部分)中,CPU 限制为 10%。但是,当我在节点(crypt12)上执行运行 "ps"或"top"命令时,工作进程的CPU利用率超过了10%(大约20%)。为什么会这样?谁能阐明这一点?
更新:
我找到了一个 github 问题讨论,在那里我找到了问题的答案:"kubectl describe node" 的 cpu 百分比是 "CPU-limits/# of Cores"。因为我把CPU-limit设置为0.8,所以10%就是0.8/8的结果。
首先,默认情况下,Top 显示每个内核的利用率百分比。所以使用 8 个内核,您可以拥有 800% 的利用率。
如果您正在正确阅读最高统计数据,那么这可能与您的节点 运行 比您的 pod 多的进程有关。想想 kube-proxy、kubelet 和任何其他控制器。 GKE 还运行一个仪表板并调用 api 进行统计。
另请注意,资源是每 100 毫秒计算一次。一个容器的利用率可能会飙升至 10% 以上,但在此期间内平均使用率绝不会超过允许值。
在 official documentation 中显示:
The spec.containers[].resources.limits.cpu is converted to its millicore value and multiplied by 100. The resulting value is the total amount of CPU time that a container can use every 100ms. A container cannot use more than its share of CPU time during this interval.
我找到了一个 github 问题讨论,在那里我找到了问题的答案:"kubectl describe node" 的 cpu 百分比是 "CPU-limits/# of Cores"。因为我把CPU-limit设置为0.8,所以10%就是0.8/8的结果。
这是 link:https://github.com/kubernetes/kubernetes/issues/24925
我 运行 在我的集群(10 个节点)中安装了几个 kubernetes pods。每个 pod 仅包含一个容器,该容器承载一个工作进程。我已经为容器指定了 CPU "limits" 和 "requests" 。以下是对节点 (crypt12) 上 运行ning 的一个 pod 的描述。
Name: alexnet-worker-6-9954df99c-p7tx5
Namespace: default
Node: crypt12/172.16.28.136
Start Time: Sun, 15 Jul 2018 22:26:57 -0400
Labels: job=worker
name=alexnet
pod-template-hash=551089557
task=6
Annotations: <none>
Status: Running
IP: 10.38.0.1
Controlled By: ReplicaSet/alexnet-worker-6-9954df99c
Containers:
alexnet-v1-container:
Container ID: docker://214e30e87ed4a7240e13e764200a260a883ea4550a1b5d09d29ed827e7b57074
Image: alexnet-tf150-py3:v1
Image ID: docker://sha256:4f18b4c45a07d639643d7aa61b06bfee1235637a50df30661466688ab2fd4e6d
Port: 5000/TCP
Host Port: 0/TCP
Command:
/usr/bin/python3
cifar10_distributed.py
Args:
--data_dir=xxxx
State: Running
Started: Sun, 15 Jul 2018 22:26:59 -0400
Ready: True
Restart Count: 0
Limits:
cpu: 800m
memory: 6G
Requests:
cpu: 800m
memory: 6G
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-hfnlp (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
default-token-hfnlp:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-hfnlp
Optional: false
QoS Class: Guaranteed
Node-Selectors: kubernetes.io/hostname=crypt12
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
下面是我运行"kubectl describle node crypt12"
时的输出Name: crypt12
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/hostname=crypt12
Annotations: kubeadm.alpha.kubernetes.io/cri-socket=/var/run/dockershim.sock
node.alpha.kubernetes.io/ttl=0
volumes.kubernetes.io/controller-managed-attach-detach=true
CreationTimestamp: Wed, 11 Jul 2018 23:07:41 -0400
Taints: <none>
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Mon, 16 Jul 2018 16:25:43 -0400 Wed, 11 Jul 2018 22:57:22 -0400 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Mon, 16 Jul 2018 16:25:43 -0400 Wed, 11 Jul 2018 22:57:22 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, 16 Jul 2018 16:25:43 -0400 Wed, 11 Jul 2018 22:57:22 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, 16 Jul 2018 16:25:43 -0400 Wed, 11 Jul 2018 22:57:22 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 16 Jul 2018 16:25:43 -0400 Wed, 11 Jul 2018 22:57:42 -0400 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 172.16.28.136
Hostname: crypt12
Capacity:
cpu: 8
ephemeral-storage: 144937600Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 8161308Ki
pods: 110
Allocatable:
cpu: 8
ephemeral-storage: 133574491939
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 8058908Ki
pods: 110
System Info:
Machine ID: f0444e00ba2ed20e5314e6bc5b0f0f60
System UUID: 37353035-3836-5355-4530-32394E44414D
Boot ID: cf2a9daf-c959-4c7e-be61-5e44a44670c4
Kernel Version: 4.4.0-87-generic
OS Image: Ubuntu 16.04.3 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://1.13.1
Kubelet Version: v1.11.0
Kube-Proxy Version: v1.11.0
Non-terminated Pods: (3 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
default alexnet-worker-6-9954df99c-p7tx5 800m (10%) 800m (10%) 6G (72%) 6G (72%)
kube-system kube-proxy-7kdkd 0 (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system weave-net-dpclj 20m (0%) 0 (0%) 0 (0%) 0 (0%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 820m (10%) 800m (10%)
memory 6G (72%) 6G (72%)
Events: <none>
如图所示,在节点描述("Non-terminated Pods" 部分)中,CPU 限制为 10%。但是,当我在节点(crypt12)上执行运行 "ps"或"top"命令时,工作进程的CPU利用率超过了10%(大约20%)。为什么会这样?谁能阐明这一点?
更新: 我找到了一个 github 问题讨论,在那里我找到了问题的答案:"kubectl describe node" 的 cpu 百分比是 "CPU-limits/# of Cores"。因为我把CPU-limit设置为0.8,所以10%就是0.8/8的结果。
首先,默认情况下,Top 显示每个内核的利用率百分比。所以使用 8 个内核,您可以拥有 800% 的利用率。
如果您正在正确阅读最高统计数据,那么这可能与您的节点 运行 比您的 pod 多的进程有关。想想 kube-proxy、kubelet 和任何其他控制器。 GKE 还运行一个仪表板并调用 api 进行统计。
另请注意,资源是每 100 毫秒计算一次。一个容器的利用率可能会飙升至 10% 以上,但在此期间内平均使用率绝不会超过允许值。
在 official documentation 中显示:
The spec.containers[].resources.limits.cpu is converted to its millicore value and multiplied by 100. The resulting value is the total amount of CPU time that a container can use every 100ms. A container cannot use more than its share of CPU time during this interval.
我找到了一个 github 问题讨论,在那里我找到了问题的答案:"kubectl describe node" 的 cpu 百分比是 "CPU-limits/# of Cores"。因为我把CPU-limit设置为0.8,所以10%就是0.8/8的结果。
这是 link:https://github.com/kubernetes/kubernetes/issues/24925