Java DNS 解析永远挂起
Java DNS resolution hangs forever
我正在使用 curator 框架连接到 zookeeper 服务器,但 运行 遇到奇怪的 DNS 解析问题。这是线程的 jstack 转储,
#21 prio=5 os_prio=0 tid=0x0000000001888800 nid=0x3a46 runnable [0x00007f25e69f3000]
java.lang.Thread.State: RUNNABLE
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress.lookupAllHostAddr(InetAddress.java:928)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at org.apache.zookeeper.client.StaticHostProvider.resolveAndShuffle(StaticHostProvider.java:117)
at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:81)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:1096)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:1006)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:804)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:679)
at com.netflix.curator.HandleHolder.getZooKeeper(HandleHolder.java:72)
- locked <0x00000000fd761f40> (a com.netflix.curator.HandleHolder)
at com.netflix.curator.HandleHolder.getZooKeeper(HandleHolder.java:46)
at com.netflix.curator.ConnectionState.reset(ConnectionState.java:122)
at com.netflix.curator.ConnectionState.start(ConnectionState.java:95)
at com.netflix.curator.CuratorZookeeperClient.start(CuratorZookeeperClient.java:137)
at com.netflix.curator.framework.imps.CuratorFrameworkImpl.start(CuratorFrameworkImpl.java:167)
线程似乎卡在本地方法中,从未returns。而且它的发生非常随机,所以一直无法重现。有任何想法吗?
我们也在努力解决这个问题。看起来这是由于 glibc 错误:https://bugzilla.kernel.org/show_bug.cgi?id=99671 or the kernel bug: https://bugzilla.redhat.com/show_bug.cgi?id=1209433 取决于你问的人 ;)
还值得一读:https://access.redhat.com/security/cve/cve-2013-7423 and https://alas.aws.amazon.com/ALAS-2015-617.html
要确认确实如此,请将 gdb 附加到 java 进程:
gdb --pid <JavaProcessPid>
然后来自 gdb:
info threads
找到执行 recvmsg 的线程:
thread <HangingThreadId>
然后是
backtrace
如果您看到类似这样的内容,那么您就知道 glibc/kernel 升级会有所帮助:
#0 0x00007fc726ff27cd in recvmsg () from /lib64/libc.so.6
#1 0x00007fc727018765 in make_request () from /lib64/libc.so.6
#2 0x00007fc727018b9a in __check_pf () from /lib64/libc.so.6
#3 0x00007fc726fdbd57 in getaddrinfo () from /lib64/libc.so.6
#4 0x00007fc706dd9635 in Java_java_net_Inet6AddressImpl_lookupAllHostAddr () from /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.65-0.b17.el6_7.x86_64/jre/lib/amd64/libnet.so
更新:看起来内核赢了。请参阅此线程:http://www.gossamer-threads.com/lists/linux/kernel/2264958 了解详细信息。
还有一个工具可以验证您的系统是否受到内核错误的影响,您可以使用这个简单的程序:https://gist.github.com/stevenschlansker/6ad46c5ccb22bc4f3473
验证:
curl -o pf_dump.c https://gist.githubusercontent.com/stevenschlansker/6ad46c5ccb22bc4f3473/raw/22cfe72f6708de1e3468c1e0fa3888aafae42db4/pf_dump.c
gcc pf_dump.c -pthread -o pf_dump
./pf_dump
如果输出是:
[26170] glibc: check_pf: netlink socket read timeout
Aborted
然后系统受到影响。如果输出类似于:
exit success [7618] exit success [7265] exit success
那么系统就ok了。
在 AWS 上下文中,使用新内核将 AMI 升级到 (2016.3.2) 似乎已经解决了这个问题。
我正在使用 curator 框架连接到 zookeeper 服务器,但 运行 遇到奇怪的 DNS 解析问题。这是线程的 jstack 转储,
#21 prio=5 os_prio=0 tid=0x0000000001888800 nid=0x3a46 runnable [0x00007f25e69f3000]
java.lang.Thread.State: RUNNABLE
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress.lookupAllHostAddr(InetAddress.java:928)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at org.apache.zookeeper.client.StaticHostProvider.resolveAndShuffle(StaticHostProvider.java:117)
at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:81)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:1096)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:1006)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:804)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:679)
at com.netflix.curator.HandleHolder.getZooKeeper(HandleHolder.java:72)
- locked <0x00000000fd761f40> (a com.netflix.curator.HandleHolder)
at com.netflix.curator.HandleHolder.getZooKeeper(HandleHolder.java:46)
at com.netflix.curator.ConnectionState.reset(ConnectionState.java:122)
at com.netflix.curator.ConnectionState.start(ConnectionState.java:95)
at com.netflix.curator.CuratorZookeeperClient.start(CuratorZookeeperClient.java:137)
at com.netflix.curator.framework.imps.CuratorFrameworkImpl.start(CuratorFrameworkImpl.java:167)
线程似乎卡在本地方法中,从未returns。而且它的发生非常随机,所以一直无法重现。有任何想法吗?
我们也在努力解决这个问题。看起来这是由于 glibc 错误:https://bugzilla.kernel.org/show_bug.cgi?id=99671 or the kernel bug: https://bugzilla.redhat.com/show_bug.cgi?id=1209433 取决于你问的人 ;)
还值得一读:https://access.redhat.com/security/cve/cve-2013-7423 and https://alas.aws.amazon.com/ALAS-2015-617.html
要确认确实如此,请将 gdb 附加到 java 进程:
gdb --pid <JavaProcessPid>
然后来自 gdb:
info threads
找到执行 recvmsg 的线程:
thread <HangingThreadId>
然后是
backtrace
如果您看到类似这样的内容,那么您就知道 glibc/kernel 升级会有所帮助:
#0 0x00007fc726ff27cd in recvmsg () from /lib64/libc.so.6
#1 0x00007fc727018765 in make_request () from /lib64/libc.so.6
#2 0x00007fc727018b9a in __check_pf () from /lib64/libc.so.6
#3 0x00007fc726fdbd57 in getaddrinfo () from /lib64/libc.so.6
#4 0x00007fc706dd9635 in Java_java_net_Inet6AddressImpl_lookupAllHostAddr () from /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.65-0.b17.el6_7.x86_64/jre/lib/amd64/libnet.so
更新:看起来内核赢了。请参阅此线程:http://www.gossamer-threads.com/lists/linux/kernel/2264958 了解详细信息。 还有一个工具可以验证您的系统是否受到内核错误的影响,您可以使用这个简单的程序:https://gist.github.com/stevenschlansker/6ad46c5ccb22bc4f3473
验证:
curl -o pf_dump.c https://gist.githubusercontent.com/stevenschlansker/6ad46c5ccb22bc4f3473/raw/22cfe72f6708de1e3468c1e0fa3888aafae42db4/pf_dump.c
gcc pf_dump.c -pthread -o pf_dump
./pf_dump
如果输出是:
[26170] glibc: check_pf: netlink socket read timeout
Aborted
然后系统受到影响。如果输出类似于:
exit success [7618] exit success [7265] exit success
那么系统就ok了。 在 AWS 上下文中,使用新内核将 AMI 升级到 (2016.3.2) 似乎已经解决了这个问题。