Docker 健康检查在一段时间后停止工作

Docker healthcheck stops working after a while

我 运行宁 docker Raspberry Pi 3 Model B Plus Rev 1.3,运行宁 Raspberry pi OS包是最新的。

TL;DR

给定容器的健康检查在一段时间内工作正常(大约 30 分钟,有时少一些,有时多一些),但在某些时候它们会“卡住”,因此容器保持健康,即使它不是案子。 有没有一种方法可以调试运行状况检查的情况,从而弄清楚发生了什么?

健康检查没有在 Dockerfile 中配置,而是在我用来部署堆栈的 yml 文件中,如下所示

healthcheck:
  test: ["CMD-SHELL", "curl -f -s -o /dev/null https://my.domain.com/icon/none.png || exit 1"]
  start_period: 1m
  interval: 5s
  timeout: 2s
  retries: 3

当我启动容器时,我会不断检查 docker inspect 并且我看到每 5 秒发生一次不同的健康检查,如定义的那样......但在某些时候,它们只是停止,我不知道为什么,如下所示

pi@openhab:~ $ date
Thu Sep 30 01:45:46 UTC 2021

pi@openhab:~ $ docker inspect ebfa93c5e815                                                                                                                                                                                                                                                   
[                                                                                                                                                                                                                                                                                            
    {                                                                                                                                                                                                                                                                                        
        "Id": "ebfa93c5e815592879b6862b33a1a384cc43b60093f8df5c1a8d51ba25a7d0ef",                                                                                                                                                                                                            
        "Created": "2021-09-30T00:36:17.319888926Z",                                                                                                                                                                                                                                         
        "Path": "/entrypoint.sh",                                                                                                                                                                                                                                                            
        "Args": [],                                                                                                                                                                                                                                                                          
        "State": {                                                                                                                                                                                                                                                                           
            "Status": "running",                                                                                                                                                                                                                                                             
            "Running": true,                                                                                                                                                                                                                                                                 
            "Paused": false,                                                                                                                                                                                                                                                                 
            "Restarting": false, 
            "OOMKilled": false,                                                                                                                                                                                                                                                              
            "Dead": false,                                                                                                                                                                                                                                                                   
            "Pid": 3743,                                                                                                                                                                                                                                                                     
            "ExitCode": 0,                                       
            "Error": "",                    
            "StartedAt": "2021-09-30T00:36:24.648900024Z",              
            "FinishedAt": "0001-01-01T00:00:00Z",                                                                                                                                                                                                                                            
            "Health": {                                                                                                                                                                                                                                                                      
                "Status": "healthy",                                                                                                                                                                                             
                "FailingStreak": 0,                                                                                             
                "Log": [                                                                                                     
                    {                                                                                                      
                        "Start": "2021-09-30T01:05:37.394601872Z",
                        "End": "2021-09-30T01:05:38.510395101Z",
                        "ExitCode": 0,  
                        "Output": ""
                    },                                         
                    {                    
                        "Start": "2021-09-30T01:05:43.538165679Z",
                        "End": "2021-09-30T01:05:44.701265903Z",
                        "ExitCode": 0,
                        "Output": ""
                    },               
                    {          
                        "Start": "2021-09-30T01:05:49.731086207Z",
                        "End": "2021-09-30T01:05:50.940299522Z",
                        "ExitCode": 0,
                        "Output": ""                                               
                    },         
                    {              
                        "Start": "2021-09-30T01:05:55.971634397Z",
                        "End": "2021-09-30T01:05:57.222192641Z",
                        "ExitCode": 0,
                        "Output": ""                                                             
                    },                
                    {                  
                        "Start": "2021-09-30T01:06:02.251407253Z",
                        "End": "2021-09-30T01:06:03.402660632Z",
                        "ExitCode": 0,
                        "Output": ""
                    }
                ]
            }
        },

可以看出,健康检查在容器启动后的 30 分钟内运行良好,然后就停止了。当前时间是上次健康检查后 40 分钟

版本

$ docker version
Client:
 Version:           18.09.1
 API version:       1.39
 Go version:        go1.11.6
 Git commit:        4c52b90
 Built:             Fri, 13 Sep 2019 10:45:43 +0100
 OS/Arch:           linux/arm
 Experimental:      false

Server:
 Engine:
  Version:          18.09.1
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.11.6
  Git commit:       4c52b90
  Built:            Fri Sep 13 09:45:43 2019
  OS/Arch:          linux/arm
  Experimental:     false
pi@openhab:~ $ docker info
Containers: 41
 Running: 6       
 Paused: 0                 
 Stopped: 35                                    
Images: 51   
Server Version: 18.09.1
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true                                     
Logging Driver: json-file       
Cgroup Driver: cgroupfs   
Plugins:                  
 Volume: local                       
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active       
 NodeID: jze7gn1w7y5fuk9ykv9omvuwh
 Is Manager: true          
 ClusterID: 0zmswkmc5o699wichuas93j83
 Managers: 1                    
 Nodes: 1                     
 Default Address Pool: 10.0.0.0/8      
 SubnetSize: 24                     
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 10
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 192.168.2.104
 Manager Addresses:
  192.168.2.104:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9754871865f7fe2f4e74d43e2fc7ccd237edcbce
runc version: 1.0.0~rc6+dfsg1-3
init version: v0.18.0 (expected: fec3683b971d9c3ef73f284f176672c44b448662)
Security Options:
 seccomp
  Profile: default
Kernel Version: 5.10.60-v7+
Operating System: Raspbian GNU/Linux 10 (buster)
OSType: linux
Architecture: armv7l
CPUs: 4
Total Memory: 923.2MiB
Name: openhab
ID: IL4N:6VFR:HOFK:7DL7:KMAS:PCNQ:7KOD:2JOM:R6I2:A5GD:HO7E:4CJQ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No memory limit support
WARNING: No swap limit support
WARNING: No kernel memory limit support
WARNING: No oom kill disable support

我正在尝试做什么

我在 raspberry pi 中安装了一个 openhab 运行ning,我希望能够远程访问它。 rPi 连接到路由器,路由器连接到调制解调器,我没有静态 IP,也不希望动态更新主机名以指向我的 IP,然后在调制解调器和路由器中配置端口转发等在...所以相反,我有一个带有静态 IP 的付费服务器,所以我想简单地 运行 SSH 从 rpi 到远程服务器,并做一个反向端口转发,这样我就可以从远程服务器。我希望在启动 rpi 时自动启动此 ssh 连接,如果出于某种原因我无法远程访问某些资源(几乎是来自健康检查的 curl 测试),则重新启动连接。 我用以下 Dockerfile

创建了一个 docker 图像
FROM alpine:3.11
RUN apk add --no-cache \
  curl \
  openssh-client \
  ca-certificates \
  bash

COPY known_hosts /known_hosts
COPY private_key /private_key
RUN chmod 0400 /private_key
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
ENTRYPOINT [ "/entrypoint.sh" ]

entrypoint.sh 就是

#!/bin/bash
ssh -Nn user@my.domain.com -i /private_key -o UserKnownHostsFile=/known_hosts -R 127.0.0.1:17280:openhab:8080

现在,这在运行状况检查 运行ning 时非常有效...我可以重新启动远程服务器,然后 swarm 将重新启动 ssh-client 容器...我可以停止 openhab,然后 swarm 重新启动ssh-client...我可以断开 rpi 与互联网的连接,swarm 重新启动 ssh-client...这一切都很好,并且按我预期的方式工作,直到出于某种原因,健康检查只是无缘无故地停止,并且容器永远保持“健康”状态...我仍然有 60% 的可用 RAM 和 62% 的可用磁盘 space...任何人都知道会发生什么?或者有什么建议?我也找不到日志...

这个问题似乎不再发生。我升级到 Raspbian bullseye,健康检查已经 运行 连续一周,没有任何问题。

pi@openhab:~ $ docker version
Client:
 Version:           20.10.5+dfsg1
 API version:       1.41
 Go version:        go1.15.9
 Git commit:        55c4c88
 Built:             Sat Dec  4 10:53:03 2021
 OS/Arch:           linux/arm
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.5+dfsg1
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.15.9
  Git commit:       363e9a8
  Built:            Sat Dec  4 10:53:03 2021
  OS/Arch:          linux/arm
  Experimental:     false
 containerd:
  Version:          1.4.13~ds1
  GitCommit:        1.4.13~ds1-1~deb11u1
 runc:
  Version:          1.0.0~rc93+ds1
  GitCommit:        1.0.0~rc93+ds1-5
 docker-init:
  Version:          0.19.0
  GitCommit: