Docker 健康检查在一段时间后停止工作
Docker healthcheck stops working after a while
我 运行宁 docker Raspberry Pi 3 Model B Plus Rev 1.3,运行宁 Raspberry pi OS包是最新的。
TL;DR
给定容器的健康检查在一段时间内工作正常(大约 30 分钟,有时少一些,有时多一些),但在某些时候它们会“卡住”,因此容器保持健康,即使它不是案子。
有没有一种方法可以调试运行状况检查的情况,从而弄清楚发生了什么?
健康检查没有在 Dockerfile 中配置,而是在我用来部署堆栈的 yml 文件中,如下所示
healthcheck:
test: ["CMD-SHELL", "curl -f -s -o /dev/null https://my.domain.com/icon/none.png || exit 1"]
start_period: 1m
interval: 5s
timeout: 2s
retries: 3
当我启动容器时,我会不断检查 docker inspect
并且我看到每 5 秒发生一次不同的健康检查,如定义的那样......但在某些时候,它们只是停止,我不知道为什么,如下所示
pi@openhab:~ $ date
Thu Sep 30 01:45:46 UTC 2021
pi@openhab:~ $ docker inspect ebfa93c5e815
[
{
"Id": "ebfa93c5e815592879b6862b33a1a384cc43b60093f8df5c1a8d51ba25a7d0ef",
"Created": "2021-09-30T00:36:17.319888926Z",
"Path": "/entrypoint.sh",
"Args": [],
"State": {
"Status": "running",
"Running": true,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 3743,
"ExitCode": 0,
"Error": "",
"StartedAt": "2021-09-30T00:36:24.648900024Z",
"FinishedAt": "0001-01-01T00:00:00Z",
"Health": {
"Status": "healthy",
"FailingStreak": 0,
"Log": [
{
"Start": "2021-09-30T01:05:37.394601872Z",
"End": "2021-09-30T01:05:38.510395101Z",
"ExitCode": 0,
"Output": ""
},
{
"Start": "2021-09-30T01:05:43.538165679Z",
"End": "2021-09-30T01:05:44.701265903Z",
"ExitCode": 0,
"Output": ""
},
{
"Start": "2021-09-30T01:05:49.731086207Z",
"End": "2021-09-30T01:05:50.940299522Z",
"ExitCode": 0,
"Output": ""
},
{
"Start": "2021-09-30T01:05:55.971634397Z",
"End": "2021-09-30T01:05:57.222192641Z",
"ExitCode": 0,
"Output": ""
},
{
"Start": "2021-09-30T01:06:02.251407253Z",
"End": "2021-09-30T01:06:03.402660632Z",
"ExitCode": 0,
"Output": ""
}
]
}
},
可以看出,健康检查在容器启动后的 30 分钟内运行良好,然后就停止了。当前时间是上次健康检查后 40 分钟
版本
$ docker version
Client:
Version: 18.09.1
API version: 1.39
Go version: go1.11.6
Git commit: 4c52b90
Built: Fri, 13 Sep 2019 10:45:43 +0100
OS/Arch: linux/arm
Experimental: false
Server:
Engine:
Version: 18.09.1
API version: 1.39 (minimum version 1.12)
Go version: go1.11.6
Git commit: 4c52b90
Built: Fri Sep 13 09:45:43 2019
OS/Arch: linux/arm
Experimental: false
pi@openhab:~ $ docker info
Containers: 41
Running: 6
Paused: 0
Stopped: 35
Images: 51
Server Version: 18.09.1
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
NodeID: jze7gn1w7y5fuk9ykv9omvuwh
Is Manager: true
ClusterID: 0zmswkmc5o699wichuas93j83
Managers: 1
Nodes: 1
Default Address Pool: 10.0.0.0/8
SubnetSize: 24
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 10
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 192.168.2.104
Manager Addresses:
192.168.2.104:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9754871865f7fe2f4e74d43e2fc7ccd237edcbce
runc version: 1.0.0~rc6+dfsg1-3
init version: v0.18.0 (expected: fec3683b971d9c3ef73f284f176672c44b448662)
Security Options:
seccomp
Profile: default
Kernel Version: 5.10.60-v7+
Operating System: Raspbian GNU/Linux 10 (buster)
OSType: linux
Architecture: armv7l
CPUs: 4
Total Memory: 923.2MiB
Name: openhab
ID: IL4N:6VFR:HOFK:7DL7:KMAS:PCNQ:7KOD:2JOM:R6I2:A5GD:HO7E:4CJQ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No memory limit support
WARNING: No swap limit support
WARNING: No kernel memory limit support
WARNING: No oom kill disable support
我正在尝试做什么
我在 raspberry pi 中安装了一个 openhab 运行ning,我希望能够远程访问它。
rPi 连接到路由器,路由器连接到调制解调器,我没有静态 IP,也不希望动态更新主机名以指向我的 IP,然后在调制解调器和路由器中配置端口转发等在...所以相反,我有一个带有静态 IP 的付费服务器,所以我想简单地 运行 SSH 从 rpi 到远程服务器,并做一个反向端口转发,这样我就可以从远程服务器。我希望在启动 rpi 时自动启动此 ssh 连接,如果出于某种原因我无法远程访问某些资源(几乎是来自健康检查的 curl 测试),则重新启动连接。
我用以下 Dockerfile
创建了一个 docker 图像
FROM alpine:3.11
RUN apk add --no-cache \
curl \
openssh-client \
ca-certificates \
bash
COPY known_hosts /known_hosts
COPY private_key /private_key
RUN chmod 0400 /private_key
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
ENTRYPOINT [ "/entrypoint.sh" ]
而 entrypoint.sh
就是
#!/bin/bash
ssh -Nn user@my.domain.com -i /private_key -o UserKnownHostsFile=/known_hosts -R 127.0.0.1:17280:openhab:8080
现在,这在运行状况检查 运行ning 时非常有效...我可以重新启动远程服务器,然后 swarm 将重新启动 ssh-client 容器...我可以停止 openhab,然后 swarm 重新启动ssh-client...我可以断开 rpi 与互联网的连接,swarm 重新启动 ssh-client...这一切都很好,并且按我预期的方式工作,直到出于某种原因,健康检查只是无缘无故地停止,并且容器永远保持“健康”状态...我仍然有 60% 的可用 RAM 和 62% 的可用磁盘 space...任何人都知道会发生什么?或者有什么建议?我也找不到日志...
这个问题似乎不再发生。我升级到 Raspbian bullseye,健康检查已经 运行 连续一周,没有任何问题。
pi@openhab:~ $ docker version
Client:
Version: 20.10.5+dfsg1
API version: 1.41
Go version: go1.15.9
Git commit: 55c4c88
Built: Sat Dec 4 10:53:03 2021
OS/Arch: linux/arm
Context: default
Experimental: true
Server:
Engine:
Version: 20.10.5+dfsg1
API version: 1.41 (minimum version 1.12)
Go version: go1.15.9
Git commit: 363e9a8
Built: Sat Dec 4 10:53:03 2021
OS/Arch: linux/arm
Experimental: false
containerd:
Version: 1.4.13~ds1
GitCommit: 1.4.13~ds1-1~deb11u1
runc:
Version: 1.0.0~rc93+ds1
GitCommit: 1.0.0~rc93+ds1-5
docker-init:
Version: 0.19.0
GitCommit:
我 运行宁 docker Raspberry Pi 3 Model B Plus Rev 1.3,运行宁 Raspberry pi OS包是最新的。
TL;DR
给定容器的健康检查在一段时间内工作正常(大约 30 分钟,有时少一些,有时多一些),但在某些时候它们会“卡住”,因此容器保持健康,即使它不是案子。 有没有一种方法可以调试运行状况检查的情况,从而弄清楚发生了什么?
健康检查没有在 Dockerfile 中配置,而是在我用来部署堆栈的 yml 文件中,如下所示
healthcheck:
test: ["CMD-SHELL", "curl -f -s -o /dev/null https://my.domain.com/icon/none.png || exit 1"]
start_period: 1m
interval: 5s
timeout: 2s
retries: 3
当我启动容器时,我会不断检查 docker inspect
并且我看到每 5 秒发生一次不同的健康检查,如定义的那样......但在某些时候,它们只是停止,我不知道为什么,如下所示
pi@openhab:~ $ date
Thu Sep 30 01:45:46 UTC 2021
pi@openhab:~ $ docker inspect ebfa93c5e815
[
{
"Id": "ebfa93c5e815592879b6862b33a1a384cc43b60093f8df5c1a8d51ba25a7d0ef",
"Created": "2021-09-30T00:36:17.319888926Z",
"Path": "/entrypoint.sh",
"Args": [],
"State": {
"Status": "running",
"Running": true,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 3743,
"ExitCode": 0,
"Error": "",
"StartedAt": "2021-09-30T00:36:24.648900024Z",
"FinishedAt": "0001-01-01T00:00:00Z",
"Health": {
"Status": "healthy",
"FailingStreak": 0,
"Log": [
{
"Start": "2021-09-30T01:05:37.394601872Z",
"End": "2021-09-30T01:05:38.510395101Z",
"ExitCode": 0,
"Output": ""
},
{
"Start": "2021-09-30T01:05:43.538165679Z",
"End": "2021-09-30T01:05:44.701265903Z",
"ExitCode": 0,
"Output": ""
},
{
"Start": "2021-09-30T01:05:49.731086207Z",
"End": "2021-09-30T01:05:50.940299522Z",
"ExitCode": 0,
"Output": ""
},
{
"Start": "2021-09-30T01:05:55.971634397Z",
"End": "2021-09-30T01:05:57.222192641Z",
"ExitCode": 0,
"Output": ""
},
{
"Start": "2021-09-30T01:06:02.251407253Z",
"End": "2021-09-30T01:06:03.402660632Z",
"ExitCode": 0,
"Output": ""
}
]
}
},
可以看出,健康检查在容器启动后的 30 分钟内运行良好,然后就停止了。当前时间是上次健康检查后 40 分钟
版本
$ docker version
Client:
Version: 18.09.1
API version: 1.39
Go version: go1.11.6
Git commit: 4c52b90
Built: Fri, 13 Sep 2019 10:45:43 +0100
OS/Arch: linux/arm
Experimental: false
Server:
Engine:
Version: 18.09.1
API version: 1.39 (minimum version 1.12)
Go version: go1.11.6
Git commit: 4c52b90
Built: Fri Sep 13 09:45:43 2019
OS/Arch: linux/arm
Experimental: false
pi@openhab:~ $ docker info
Containers: 41
Running: 6
Paused: 0
Stopped: 35
Images: 51
Server Version: 18.09.1
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
NodeID: jze7gn1w7y5fuk9ykv9omvuwh
Is Manager: true
ClusterID: 0zmswkmc5o699wichuas93j83
Managers: 1
Nodes: 1
Default Address Pool: 10.0.0.0/8
SubnetSize: 24
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 10
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 192.168.2.104
Manager Addresses:
192.168.2.104:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9754871865f7fe2f4e74d43e2fc7ccd237edcbce
runc version: 1.0.0~rc6+dfsg1-3
init version: v0.18.0 (expected: fec3683b971d9c3ef73f284f176672c44b448662)
Security Options:
seccomp
Profile: default
Kernel Version: 5.10.60-v7+
Operating System: Raspbian GNU/Linux 10 (buster)
OSType: linux
Architecture: armv7l
CPUs: 4
Total Memory: 923.2MiB
Name: openhab
ID: IL4N:6VFR:HOFK:7DL7:KMAS:PCNQ:7KOD:2JOM:R6I2:A5GD:HO7E:4CJQ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No memory limit support
WARNING: No swap limit support
WARNING: No kernel memory limit support
WARNING: No oom kill disable support
我正在尝试做什么
我在 raspberry pi 中安装了一个 openhab 运行ning,我希望能够远程访问它。
rPi 连接到路由器,路由器连接到调制解调器,我没有静态 IP,也不希望动态更新主机名以指向我的 IP,然后在调制解调器和路由器中配置端口转发等在...所以相反,我有一个带有静态 IP 的付费服务器,所以我想简单地 运行 SSH 从 rpi 到远程服务器,并做一个反向端口转发,这样我就可以从远程服务器。我希望在启动 rpi 时自动启动此 ssh 连接,如果出于某种原因我无法远程访问某些资源(几乎是来自健康检查的 curl 测试),则重新启动连接。
我用以下 Dockerfile
FROM alpine:3.11
RUN apk add --no-cache \
curl \
openssh-client \
ca-certificates \
bash
COPY known_hosts /known_hosts
COPY private_key /private_key
RUN chmod 0400 /private_key
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
ENTRYPOINT [ "/entrypoint.sh" ]
而 entrypoint.sh
就是
#!/bin/bash
ssh -Nn user@my.domain.com -i /private_key -o UserKnownHostsFile=/known_hosts -R 127.0.0.1:17280:openhab:8080
现在,这在运行状况检查 运行ning 时非常有效...我可以重新启动远程服务器,然后 swarm 将重新启动 ssh-client 容器...我可以停止 openhab,然后 swarm 重新启动ssh-client...我可以断开 rpi 与互联网的连接,swarm 重新启动 ssh-client...这一切都很好,并且按我预期的方式工作,直到出于某种原因,健康检查只是无缘无故地停止,并且容器永远保持“健康”状态...我仍然有 60% 的可用 RAM 和 62% 的可用磁盘 space...任何人都知道会发生什么?或者有什么建议?我也找不到日志...
这个问题似乎不再发生。我升级到 Raspbian bullseye,健康检查已经 运行 连续一周,没有任何问题。
pi@openhab:~ $ docker version
Client:
Version: 20.10.5+dfsg1
API version: 1.41
Go version: go1.15.9
Git commit: 55c4c88
Built: Sat Dec 4 10:53:03 2021
OS/Arch: linux/arm
Context: default
Experimental: true
Server:
Engine:
Version: 20.10.5+dfsg1
API version: 1.41 (minimum version 1.12)
Go version: go1.15.9
Git commit: 363e9a8
Built: Sat Dec 4 10:53:03 2021
OS/Arch: linux/arm
Experimental: false
containerd:
Version: 1.4.13~ds1
GitCommit: 1.4.13~ds1-1~deb11u1
runc:
Version: 1.0.0~rc93+ds1
GitCommit: 1.0.0~rc93+ds1-5
docker-init:
Version: 0.19.0
GitCommit: