docker 容器的 DNS 查找在大约 36 小时的正常运行时间后中断
Dns lookup for docker container breaks after ~36 hours of uptime
我在主机上通过 docker-compose(dns 通过 docker 守护进程 dns 服务器 127.0.0.11 完成)部署了一个容器,dns 服务器配置为私有网络/etc/resolv.conf
无法访问互联网。
容器运行良好一段时间(大约 40 小时)然后开始失败其 dns 查找并显示超时消息:
应用程序日志显示 docker dns 服务器失败:
Caused by: java.net.UnknownHostException: failed to resolve 'alfresco.test.duf'
at io.netty.resolver.dns.DnsResolveContext.finishResolve(DnsResolveContext.java:1013)
at io.netty.resolver.dns.DnsResolveContext.tryToFinishResolve(DnsResolveContext.java:966)
at io.netty.resolver.dns.DnsResolveContext.query(DnsResolveContext.java:414)
at io.netty.resolver.dns.DnsResolveContext.access0(DnsResolveContext.java:63)
at io.netty.resolver.dns.DnsResolveContext.operationComplete(DnsResolveContext.java:463)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
at io.netty.resolver.dns.DnsQueryContext.tryFailure(DnsQueryContext.java:225)
at io.netty.resolver.dns.DnsQueryContext.run(DnsQueryContext.java:177)
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
at io.netty.util.concurrent.SingleThreadEventExecutor.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: io.netty.resolver.dns.DnsNameResolverTimeoutException: [/127.0.0.11:53] query via UDP timed out after 5000 milliseconds (no stack trace available)
docker 守护进程日志显示本地网络 DNS 服务器失败:
Aug 25 12:19:15 st2510v dockerd[6749]: time="2021-08-25T12:19:15.066556867+02:00" level=warning msg="[resolver] connect failed: dial udp 157.164.138.33:53: connect: resource temporarily unavailable"
从 docker 主机 Ping 目标服务器正确解析。
在 docker 网络(通过组合创建)中启动一个 bash 容器并从那里 ping 目标服务器正确解析。
在有问题的容器内对任何服务器(外部 dns、docker dns、bash容器)进行 Ping 操作都无法解决。
容器无法自行从错误中恢复。
重新启动或重新创建容器确实可以解决问题。
我已经将主机 iptables 和网络接口与一个完全没有问题的工作实例进行了比较,但这并没有产生任何显着差异。
关于问题是什么或如何诊断问题的任何建议?
更新 1
Docker 版本输出:
[al6735@st2510v ~]$ sudo docker version
Client: Docker Engine - Community
Version: 19.03.5
API version: 1.40
Go version: go1.12.12
Git commit: 633a0ea
Built: Wed Nov 13 07:25:41 2019
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.5
API version: 1.40 (minimum version 1.12)
Go version: go1.12.12
Git commit: 633a0ea
Built: Wed Nov 13 07:24:18 2019
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.2.13
GitCommit: 7ad184331fa3e55e52b890ea95e65ba581ae3429
runc:
Version: 1.0.0-rc10
GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
docker-init:
Version: 0.18.0
GitCommit: fec3683
Docker 信息输出:
[al6735@st2510v ~]$ sudo docker info
Client:
Debug Mode: false
Server:
Containers: 2
Running: 2
Paused: 0
Stopped: 0
Images: 3
Server Version: 19.03.5
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-957.21.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.51GiB
Name: st2510v
ID: KTEE:M3ZD:5ZS5:DVFU:R6VJ:YV7Q:QPP5:D4YG:ITV7:YC3U:YP3J:AEDG
Docker Root Dir: /home/docker
Debug Mode: true
File Descriptors: 38
Goroutines: 48
System Time: 2021-09-24T14:23:42.314595155+02:00
EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
对主机的进一步检查表明目标容器中的 java 应用程序持有大量 tcp 套接字。
修复上述问题后,连接问题不再出现。大概我们达到了容器可以拥有的打开套接字数量的限制。
我在主机上通过 docker-compose(dns 通过 docker 守护进程 dns 服务器 127.0.0.11 完成)部署了一个容器,dns 服务器配置为私有网络/etc/resolv.conf
无法访问互联网。
容器运行良好一段时间(大约 40 小时)然后开始失败其 dns 查找并显示超时消息: 应用程序日志显示 docker dns 服务器失败:
Caused by: java.net.UnknownHostException: failed to resolve 'alfresco.test.duf'
at io.netty.resolver.dns.DnsResolveContext.finishResolve(DnsResolveContext.java:1013)
at io.netty.resolver.dns.DnsResolveContext.tryToFinishResolve(DnsResolveContext.java:966)
at io.netty.resolver.dns.DnsResolveContext.query(DnsResolveContext.java:414)
at io.netty.resolver.dns.DnsResolveContext.access0(DnsResolveContext.java:63)
at io.netty.resolver.dns.DnsResolveContext.operationComplete(DnsResolveContext.java:463)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
at io.netty.resolver.dns.DnsQueryContext.tryFailure(DnsQueryContext.java:225)
at io.netty.resolver.dns.DnsQueryContext.run(DnsQueryContext.java:177)
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
at io.netty.util.concurrent.SingleThreadEventExecutor.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: io.netty.resolver.dns.DnsNameResolverTimeoutException: [/127.0.0.11:53] query via UDP timed out after 5000 milliseconds (no stack trace available)
docker 守护进程日志显示本地网络 DNS 服务器失败:
Aug 25 12:19:15 st2510v dockerd[6749]: time="2021-08-25T12:19:15.066556867+02:00" level=warning msg="[resolver] connect failed: dial udp 157.164.138.33:53: connect: resource temporarily unavailable"
从 docker 主机 Ping 目标服务器正确解析。
在 docker 网络(通过组合创建)中启动一个 bash 容器并从那里 ping 目标服务器正确解析。
在有问题的容器内对任何服务器(外部 dns、docker dns、bash容器)进行 Ping 操作都无法解决。
容器无法自行从错误中恢复。
重新启动或重新创建容器确实可以解决问题。
我已经将主机 iptables 和网络接口与一个完全没有问题的工作实例进行了比较,但这并没有产生任何显着差异。
关于问题是什么或如何诊断问题的任何建议?
更新 1
Docker 版本输出:
[al6735@st2510v ~]$ sudo docker version
Client: Docker Engine - Community
Version: 19.03.5
API version: 1.40
Go version: go1.12.12
Git commit: 633a0ea
Built: Wed Nov 13 07:25:41 2019
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.5
API version: 1.40 (minimum version 1.12)
Go version: go1.12.12
Git commit: 633a0ea
Built: Wed Nov 13 07:24:18 2019
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.2.13
GitCommit: 7ad184331fa3e55e52b890ea95e65ba581ae3429
runc:
Version: 1.0.0-rc10
GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
docker-init:
Version: 0.18.0
GitCommit: fec3683
Docker 信息输出:
[al6735@st2510v ~]$ sudo docker info
Client:
Debug Mode: false
Server:
Containers: 2
Running: 2
Paused: 0
Stopped: 0
Images: 3
Server Version: 19.03.5
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-957.21.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.51GiB
Name: st2510v
ID: KTEE:M3ZD:5ZS5:DVFU:R6VJ:YV7Q:QPP5:D4YG:ITV7:YC3U:YP3J:AEDG
Docker Root Dir: /home/docker
Debug Mode: true
File Descriptors: 38
Goroutines: 48
System Time: 2021-09-24T14:23:42.314595155+02:00
EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
对主机的进一步检查表明目标容器中的 java 应用程序持有大量 tcp 套接字。
修复上述问题后,连接问题不再出现。大概我们达到了容器可以拥有的打开套接字数量的限制。