docker 容器的 DNS 查找在大约 36 小时的正常运行时间后中断

Dns lookup for docker container breaks after ~36 hours of uptime

我在主机上通过 docker-compose(dns 通过 docker 守护进程 dns 服务器 127.0.0.11 完成)部署了一个容器,dns 服务器配置为私有网络/etc/resolv.conf 无法访问互联网。

容器运行良好一段时间(大约 40 小时)然后开始失败其 dns 查找并显示超时消息: 应用程序日志显示 docker dns 服务器失败:

Caused by: java.net.UnknownHostException: failed to resolve 'alfresco.test.duf'
        at io.netty.resolver.dns.DnsResolveContext.finishResolve(DnsResolveContext.java:1013)
        at io.netty.resolver.dns.DnsResolveContext.tryToFinishResolve(DnsResolveContext.java:966)
        at io.netty.resolver.dns.DnsResolveContext.query(DnsResolveContext.java:414)
        at io.netty.resolver.dns.DnsResolveContext.access0(DnsResolveContext.java:63)
        at io.netty.resolver.dns.DnsResolveContext.operationComplete(DnsResolveContext.java:463)
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571)
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
        at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
        at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
        at io.netty.resolver.dns.DnsQueryContext.tryFailure(DnsQueryContext.java:225)
        at io.netty.resolver.dns.DnsQueryContext.run(DnsQueryContext.java:177)
        at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
        at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
        at io.netty.util.concurrent.SingleThreadEventExecutor.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:834)
    Caused by: io.netty.resolver.dns.DnsNameResolverTimeoutException: [/127.0.0.11:53] query via UDP timed out after 5000 milliseconds (no stack trace available)

docker 守护进程日志显示本地网络 DNS 服务器失败:

Aug 25 12:19:15 st2510v dockerd[6749]: time="2021-08-25T12:19:15.066556867+02:00" level=warning msg="[resolver] connect failed: dial udp 157.164.138.33:53: connect: resource temporarily unavailable"

从 docker 主机 Ping 目标服务器正确解析。

在 docker 网络(通过组合创建)中启动一个 bash 容器并从那里 ping 目标服务器正确解析。

在有问题的容器内对任何服务器(外部 dns、docker dns、bash容器)进行 Ping 操作都无法解决。

容器无法自行从错误中恢复。

重新启动或重新创建容器确实可以解决问题。

我已经将主机 iptables 和网络接口与一个完全没有问题的工作实例进行了比较,但这并没有产生任何显着差异。

关于问题是什么或如何诊断问题的任何建议?

更新 1

Docker 版本输出:

[al6735@st2510v ~]$ sudo docker version
Client: Docker Engine - Community
 Version:           19.03.5
 API version:       1.40
 Go version:        go1.12.12
 Git commit:        633a0ea
 Built:             Wed Nov 13 07:25:41 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.5
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.12
  Git commit:       633a0ea
  Built:            Wed Nov 13 07:24:18 2019
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.13
  GitCommit:        7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Docker 信息输出:

[al6735@st2510v ~]$ sudo docker info
Client:
 Debug Mode: false

Server:
 Containers: 2
  Running: 2
  Paused: 0
  Stopped: 0
 Images: 3
 Server Version: 19.03.5
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 3.10.0-957.21.2.el7.x86_64
 Operating System: CentOS Linux 7 (Core)
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 15.51GiB
 Name: st2510v
 ID: KTEE:M3ZD:5ZS5:DVFU:R6VJ:YV7Q:QPP5:D4YG:ITV7:YC3U:YP3J:AEDG
 Docker Root Dir: /home/docker
 Debug Mode: true
  File Descriptors: 38
  Goroutines: 48
  System Time: 2021-09-24T14:23:42.314595155+02:00
  EventsListeners: 0
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

对主机的进一步检查表明目标容器中的 java 应用程序持有大量 tcp 套接字。

修复上述问题后,连接问题不再出现。大概我们达到了容器可以拥有的打开套接字数量的限制。