如何修复 Openshift pod 启动因 NodeUnderDiskPressure 而失败?
How to fix Openshift pod start failed with NodeUnderDiskPressure?
我使用
在本地启动 openshift
oc cluster up
然后我使用 hello-pod.json 和这个命令
创建一个 pod
oc create -f examples/hello-openshift/hello-pod.json
pod 已创建,但无法启动。 Openshift 显示错误:
Reason: Failed Scheduling
Message: 0/1 nodes are available: 1 NodeUnderDiskPressure.
我的硬盘上还有很多空闲 space。我不知道去哪里寻找其他日志。如何解决问题?
基本上我只需要在我的主用户目录中恢复 docker 的文件系统和 kubernetes 配置。
$ oc cluster down
$ sudo systemctl stop docker
$ sudo rm -rf /var/lib/docker
$ rm -rf ~/.kube
$ sudo systemctl start docker
$ oc cluster up
完成! -- 之后我能够创建 pods。
以下是我在确定相同 NodeUnderDiskPressure
时尝试过的其他一些方法,如果这不能解决问题,可能会对您有所帮助:
首先,我通过以下方式从 kubectl 检索可用节点:
$ oc login -u system:admin
$ kubectl get nodes
NAME STATUS AGE VERSION
localhost Ready 12h v1.7.6+a08f5eeb62
接下来我检索了 localhost
节点的描述:
$ kubectl describe node localhost
Name: localhost
Role:
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/hostname=localhost
Annotations: volumes.kubernetes.io/controller-managed-attach-detach=true
Taints: <none>
CreationTimestamp: Mon, 05 Mar 2018 20:00:20 -0600
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Tue, 06 Mar 2018 08:09:03 -0600 Mon, 05 Mar 2018 20:00:20 -0600 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Tue, 06 Mar 2018 08:09:03 -0600 Mon, 05 Mar 2018 20:00:20 -0600 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure True Tue, 06 Mar 2018 08:09:03 -0600 Mon, 05 Mar 2018 20:00:31 -0600 KubeletHasDiskPressure kubelet has disk pressure
Ready True Tue, 06 Mar 2018 08:09:03 -0600 Mon, 05 Mar 2018 20:00:31 -0600 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 192.168.0.14
Hostname: localhost
Capacity:
cpu: 4
memory: 16311024Ki
pods: 40
Allocatable:
cpu: 4
memory: 16208624Ki
pods: 40
System Info:
Machine ID: 6895f77789824d26acef6d0db236319f
System UUID: 248A664C-33F8-11B2-A85C-FC31558EDC86
Boot ID: 1a5cc22b-81f1-4b07-b26f-917a7d17936f
Kernel Version: 4.13.16-100.fc25.x86_64
OS Image: CentOS Linux 7 (Core)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://1.12.6
Kubelet Version: v1.7.6+a08f5eeb62
Kube-Proxy Version: v1.7.6+a08f5eeb62
ExternalID: localhost
Non-terminated Pods: (0 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
0 (0%) 0 (0%) 0 (0%) 0 (0%)
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
12h 8m 2877 kubelet, localhost Warning EvictionThresholdMet Attempting to reclaim imagefs
11h 3m 136 kubelet, localhost Warning ImageGCFailed (combined from similar events): wanted to free 3113113190 bytes, but freed 0 bytes space with errors in image deletion: [rpc error: code = 2 desc = Error response from daemon: {"message":"conflict: unable to delete 933861786d39 (must be forced) - image is being used by stopped container 82eca7ad6fd6"}, rpc error: code = 2 desc = Error response from daemon: {"message":"conflict: unable to delete bcccfe5352d3 (must be forced) - image is being used by stopped container 9c4ad3dc4b80"}, rpc error: code = 2 desc = Error response from daemon: {"message":"conflict: unable to delete b7b0dbc4f785 (must be forced) - image is being used by stopped container d388fa17ff84"}, rpc error: code = 2 desc = Error response from daemon: {"message":"conflict: unable to delete 0129e5e73319 (cannot be forced) - image has dependent child images"}, rpc error: code = 2 desc = Error response from daemon: {"message":"conflict: unable to delete 725dcfab7d63 (must be forced) - image is being used by stopped container 9eb3a771aa6f"}, rpc error: code = 2 desc = Error response from daemon: {"message":"conflict: unable to delete 8ec432b4cda3 (cannot be forced) - image is being used by running container a3fe6da22775"}]
有几点需要注意:
DiskPressure
True
中的条件状态
Events
警告:首先我可以看到 EvictionThreshold
Attempting to reclaim imagefs;我还可以看到 ImageGCFailed
,其中包含有关无法处理的图像的详细信息。
在我的案例中,ImageGCFailed
消息的格式化 JSON:
(combined from similar events):wanted to free 3113113190 bytes,
but freed 0 bytes space with errors in image deletion:[
rpc error: code = 2 desc = Error response from daemon:{
"message":"conflict: unable to delete 933861786d39 (must be forced) - image is being used by stopped container 82eca7ad6fd6"
},
rpc error: code = 2 desc = Error response from daemon:{
"message":"conflict: unable to delete bcccfe5352d3 (must be forced) - image is being used by stopped container 9c4ad3dc4b80"
},
rpc error: code = 2 desc = Error response from daemon:{
"message":"conflict: unable to delete b7b0dbc4f785 (must be forced) - image is being used by stopped container d388fa17ff84"
},
rpc error: code = 2 desc = Error response from daemon:{
"message":"conflict: unable to delete 0129e5e73319 (cannot be forced) - image has dependent child images"
},
rpc error: code = 2 desc = Error response from daemon:{
"message":"conflict: unable to delete 725dcfab7d63 (must be forced) - image is being used by stopped container 9eb3a771aa6f"
},
rpc error: code = 2 desc = Error response from daemon:{
"message":"conflict: unable to delete 8ec432b4cda3 (cannot be forced) - image is being used by running container a3fe6da22775"
}
]
基于此信息:https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#reclaiming-node-level-resources
现在我调查可用的容器并尝试手动删除它们:
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a3fe6da22775 openshift/origin:v3.7.1 "/usr/bin/openshift s" 12 hours ago Up 12 hours origin
82eca7ad6fd6 dtf-bpms/nodejs-mongo-persistent-2:4e90f728 "/bin/sh -ic 'npm sta" 3 months ago Exited (137) 3 months ago openshift_s2i-build_nodejs-mongo-persistent-2_dtf-bpms_post-commit_fe89fcfd
9c4ad3dc4b80 dtf-bpms/nodejs-mongo-persistent-2:4e23c7d5 "/bin/sh -ic 'npm tes" 3 months ago Exited (137) 3 months ago openshift_s2i-build_nodejs-mongo-persistent-2_dtf-bpms_post-commit_de141bcd
d388fa17ff84 dtf-bpms/nodejs-mongo-persistent-1:439d35ea "/bin/sh -ic 'npm tes" 3 months ago Exited (137) 3 months ago openshift_s2i-build_nodejs-mongo-persistent-1_dtf-bpms_post-commit_277b19ca
9eb3a771aa6f hello-world "/hello" 3 months ago Exited (0) 3 months ago serene_babbage
现在我将手动删除所有已停止的容器:
$ docker rm $(docker ps -a -q)
82eca7ad6fd6
9c4ad3dc4b80
d388fa17ff84
9eb3a771aa6f
Error response from daemon: You cannot remove a running container a3fe6da22775a559fe94ab0eb5f52d55d9aca6d1f950f107d13243fa029e071f. Stop the container before attempting removal or use -f
在这种情况下,可以保留 openshift 容器。
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a3fe6da22775 openshift/origin:v3.7.1 "/usr/bin/openshift s" 12 hours ago Up 12 hours origin
接下来我重新启动 openshift 和 docker 并尝试再次创建我的容器并描述 localhost
节点:
$ oc cluster down
$ sudo systemctl restart docker
$ oc cluster up
... (wait for cluster up start)
$ [CREATE PROJECT AND CONTAINERS]
$ oc login -u system:admin
$ kubectl describe node localhost
... (node description and header information)
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
1h 1h 2 kubelet, localhost Normal NodeHasSufficientMemory Node localhost status is now: NodeHasSufficientMemory
1h 1h 2 kubelet, localhost Normal NodeHasNoDiskPressure Node localhost status is now: NodeHasNoDiskPressure
1h 1h 1 kubelet, localhost Normal NodeAllocatableEnforced Updated Node Allocatable limit across pods
1h 1h 2 kubelet, localhost Normal NodeHasSufficientDisk Node localhost status is now: NodeHasSufficientDisk
1h 1h 1 kubelet, localhost Normal NodeReady Node localhost status is now: NodeReady
1h 1h 1 kubelet, localhost Normal NodeHasDiskPressure Node localhost status is now: NodeHasDiskPressure
1h 1h 1 kubelet, localhost Warning ImageGCFailed wanted to free 2934625894 bytes, but freed 0 bytes space with errors in image deletion: rpc error: code = 2 desc = Error response from daemon: {"message":"conflict: unable to delete 8ec432b4cda3 (cannot be forced) - image is being used by running container 4bcd2196747c"}
你可以看到我在清理旧的未使用容器后继续看到 NodeHasDiskPressure
,图像已从 Docker 事件中释放。这里是下一步要删除旧的脏 DOCKER 文件系统并从一个新文件系统开始的地方。
就我而言,node-config.yaml
的调整解决了问题:
1) 搜索生成的文件 node-config.yaml
例如在 /var/lib/origin/
或您的自定义配置路径下。
2) 在编辑器中打开并搜索 kubeletArguments
并添加您想要的磁盘驱逐策略:
kubeletArguments:
eviction-hard:
- memory.available<100Mi
- nodefs.available<1%
- nodefs.inodesFree<1%
- imagefs.available<1%
可以在这里找到详细的描述:OpenShift Documentation - Default Hard Eviction Thresholds
我使用
在本地启动 openshiftoc cluster up
然后我使用 hello-pod.json 和这个命令
创建一个 podoc create -f examples/hello-openshift/hello-pod.json
pod 已创建,但无法启动。 Openshift 显示错误:
Reason: Failed Scheduling
Message: 0/1 nodes are available: 1 NodeUnderDiskPressure.
我的硬盘上还有很多空闲 space。我不知道去哪里寻找其他日志。如何解决问题?
基本上我只需要在我的主用户目录中恢复 docker 的文件系统和 kubernetes 配置。
$ oc cluster down
$ sudo systemctl stop docker
$ sudo rm -rf /var/lib/docker
$ rm -rf ~/.kube
$ sudo systemctl start docker
$ oc cluster up
完成! -- 之后我能够创建 pods。
以下是我在确定相同 NodeUnderDiskPressure
时尝试过的其他一些方法,如果这不能解决问题,可能会对您有所帮助:
首先,我通过以下方式从 kubectl 检索可用节点:
$ oc login -u system:admin
$ kubectl get nodes
NAME STATUS AGE VERSION
localhost Ready 12h v1.7.6+a08f5eeb62
接下来我检索了 localhost
节点的描述:
$ kubectl describe node localhost
Name: localhost
Role:
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/hostname=localhost
Annotations: volumes.kubernetes.io/controller-managed-attach-detach=true
Taints: <none>
CreationTimestamp: Mon, 05 Mar 2018 20:00:20 -0600
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Tue, 06 Mar 2018 08:09:03 -0600 Mon, 05 Mar 2018 20:00:20 -0600 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Tue, 06 Mar 2018 08:09:03 -0600 Mon, 05 Mar 2018 20:00:20 -0600 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure True Tue, 06 Mar 2018 08:09:03 -0600 Mon, 05 Mar 2018 20:00:31 -0600 KubeletHasDiskPressure kubelet has disk pressure
Ready True Tue, 06 Mar 2018 08:09:03 -0600 Mon, 05 Mar 2018 20:00:31 -0600 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 192.168.0.14
Hostname: localhost
Capacity:
cpu: 4
memory: 16311024Ki
pods: 40
Allocatable:
cpu: 4
memory: 16208624Ki
pods: 40
System Info:
Machine ID: 6895f77789824d26acef6d0db236319f
System UUID: 248A664C-33F8-11B2-A85C-FC31558EDC86
Boot ID: 1a5cc22b-81f1-4b07-b26f-917a7d17936f
Kernel Version: 4.13.16-100.fc25.x86_64
OS Image: CentOS Linux 7 (Core)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://1.12.6
Kubelet Version: v1.7.6+a08f5eeb62
Kube-Proxy Version: v1.7.6+a08f5eeb62
ExternalID: localhost
Non-terminated Pods: (0 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
0 (0%) 0 (0%) 0 (0%) 0 (0%)
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
12h 8m 2877 kubelet, localhost Warning EvictionThresholdMet Attempting to reclaim imagefs
11h 3m 136 kubelet, localhost Warning ImageGCFailed (combined from similar events): wanted to free 3113113190 bytes, but freed 0 bytes space with errors in image deletion: [rpc error: code = 2 desc = Error response from daemon: {"message":"conflict: unable to delete 933861786d39 (must be forced) - image is being used by stopped container 82eca7ad6fd6"}, rpc error: code = 2 desc = Error response from daemon: {"message":"conflict: unable to delete bcccfe5352d3 (must be forced) - image is being used by stopped container 9c4ad3dc4b80"}, rpc error: code = 2 desc = Error response from daemon: {"message":"conflict: unable to delete b7b0dbc4f785 (must be forced) - image is being used by stopped container d388fa17ff84"}, rpc error: code = 2 desc = Error response from daemon: {"message":"conflict: unable to delete 0129e5e73319 (cannot be forced) - image has dependent child images"}, rpc error: code = 2 desc = Error response from daemon: {"message":"conflict: unable to delete 725dcfab7d63 (must be forced) - image is being used by stopped container 9eb3a771aa6f"}, rpc error: code = 2 desc = Error response from daemon: {"message":"conflict: unable to delete 8ec432b4cda3 (cannot be forced) - image is being used by running container a3fe6da22775"}]
有几点需要注意:
DiskPressure
True
中的条件状态Events
警告:首先我可以看到EvictionThreshold
Attempting to reclaim imagefs;我还可以看到ImageGCFailed
,其中包含有关无法处理的图像的详细信息。
在我的案例中,ImageGCFailed
消息的格式化 JSON:
(combined from similar events):wanted to free 3113113190 bytes,
but freed 0 bytes space with errors in image deletion:[
rpc error: code = 2 desc = Error response from daemon:{
"message":"conflict: unable to delete 933861786d39 (must be forced) - image is being used by stopped container 82eca7ad6fd6"
},
rpc error: code = 2 desc = Error response from daemon:{
"message":"conflict: unable to delete bcccfe5352d3 (must be forced) - image is being used by stopped container 9c4ad3dc4b80"
},
rpc error: code = 2 desc = Error response from daemon:{
"message":"conflict: unable to delete b7b0dbc4f785 (must be forced) - image is being used by stopped container d388fa17ff84"
},
rpc error: code = 2 desc = Error response from daemon:{
"message":"conflict: unable to delete 0129e5e73319 (cannot be forced) - image has dependent child images"
},
rpc error: code = 2 desc = Error response from daemon:{
"message":"conflict: unable to delete 725dcfab7d63 (must be forced) - image is being used by stopped container 9eb3a771aa6f"
},
rpc error: code = 2 desc = Error response from daemon:{
"message":"conflict: unable to delete 8ec432b4cda3 (cannot be forced) - image is being used by running container a3fe6da22775"
}
]
基于此信息:https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#reclaiming-node-level-resources 现在我调查可用的容器并尝试手动删除它们:
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a3fe6da22775 openshift/origin:v3.7.1 "/usr/bin/openshift s" 12 hours ago Up 12 hours origin
82eca7ad6fd6 dtf-bpms/nodejs-mongo-persistent-2:4e90f728 "/bin/sh -ic 'npm sta" 3 months ago Exited (137) 3 months ago openshift_s2i-build_nodejs-mongo-persistent-2_dtf-bpms_post-commit_fe89fcfd
9c4ad3dc4b80 dtf-bpms/nodejs-mongo-persistent-2:4e23c7d5 "/bin/sh -ic 'npm tes" 3 months ago Exited (137) 3 months ago openshift_s2i-build_nodejs-mongo-persistent-2_dtf-bpms_post-commit_de141bcd
d388fa17ff84 dtf-bpms/nodejs-mongo-persistent-1:439d35ea "/bin/sh -ic 'npm tes" 3 months ago Exited (137) 3 months ago openshift_s2i-build_nodejs-mongo-persistent-1_dtf-bpms_post-commit_277b19ca
9eb3a771aa6f hello-world "/hello" 3 months ago Exited (0) 3 months ago serene_babbage
现在我将手动删除所有已停止的容器:
$ docker rm $(docker ps -a -q)
82eca7ad6fd6
9c4ad3dc4b80
d388fa17ff84
9eb3a771aa6f
Error response from daemon: You cannot remove a running container a3fe6da22775a559fe94ab0eb5f52d55d9aca6d1f950f107d13243fa029e071f. Stop the container before attempting removal or use -f
在这种情况下,可以保留 openshift 容器。
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a3fe6da22775 openshift/origin:v3.7.1 "/usr/bin/openshift s" 12 hours ago Up 12 hours origin
接下来我重新启动 openshift 和 docker 并尝试再次创建我的容器并描述 localhost
节点:
$ oc cluster down
$ sudo systemctl restart docker
$ oc cluster up
... (wait for cluster up start)
$ [CREATE PROJECT AND CONTAINERS]
$ oc login -u system:admin
$ kubectl describe node localhost
... (node description and header information)
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
1h 1h 2 kubelet, localhost Normal NodeHasSufficientMemory Node localhost status is now: NodeHasSufficientMemory
1h 1h 2 kubelet, localhost Normal NodeHasNoDiskPressure Node localhost status is now: NodeHasNoDiskPressure
1h 1h 1 kubelet, localhost Normal NodeAllocatableEnforced Updated Node Allocatable limit across pods
1h 1h 2 kubelet, localhost Normal NodeHasSufficientDisk Node localhost status is now: NodeHasSufficientDisk
1h 1h 1 kubelet, localhost Normal NodeReady Node localhost status is now: NodeReady
1h 1h 1 kubelet, localhost Normal NodeHasDiskPressure Node localhost status is now: NodeHasDiskPressure
1h 1h 1 kubelet, localhost Warning ImageGCFailed wanted to free 2934625894 bytes, but freed 0 bytes space with errors in image deletion: rpc error: code = 2 desc = Error response from daemon: {"message":"conflict: unable to delete 8ec432b4cda3 (cannot be forced) - image is being used by running container 4bcd2196747c"}
你可以看到我在清理旧的未使用容器后继续看到 NodeHasDiskPressure
,图像已从 Docker 事件中释放。这里是下一步要删除旧的脏 DOCKER 文件系统并从一个新文件系统开始的地方。
就我而言,node-config.yaml
的调整解决了问题:
1) 搜索生成的文件 node-config.yaml
例如在 /var/lib/origin/
或您的自定义配置路径下。
2) 在编辑器中打开并搜索 kubeletArguments
并添加您想要的磁盘驱逐策略:
kubeletArguments:
eviction-hard:
- memory.available<100Mi
- nodefs.available<1%
- nodefs.inodesFree<1%
- imagefs.available<1%
可以在这里找到详细的描述:OpenShift Documentation - Default Hard Eviction Thresholds