YARN 抱怨 java.net.NoRouteToHostException:没有到主机的路由(主机无法访问)

YARN complains java.net.NoRouteToHostException: No route to host (Host unreachable)

尝试在 HDP 3.1 集群上 运行 h2o 并且 运行 出现似乎与 YARN 资源容量有关的错误...

[ml1user@HW04 h2o-3.26.0.1-hdp3.1]$ hadoop jar h2odriver.jar -nodes 3 -mapperXmx 10g
Determining driver host interface for mapper->driver callback...
    [Possible callback IP address: 192.168.122.1]
    [Possible callback IP address: 172.18.4.49]
    [Possible callback IP address: 127.0.0.1]
Using mapper->driver callback IP address and port: 172.18.4.49:46015
(You can override these with -driverif and -driverport/-driverportrange and/or specify external IP using -extdriverif.)
Memory Settings:
    mapreduce.map.java.opts:     -Xms10g -Xmx10g -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dlog4j.defaultInitOverride=true
    Extra memory percent:        10
    mapreduce.map.memory.mb:     11264
Hive driver not present, not generating token.
19/07/25 14:48:05 INFO client.RMProxy: Connecting to ResourceManager at hw01.ucera.local/172.18.4.46:8050
19/07/25 14:48:06 INFO client.AHSProxy: Connecting to Application History server at hw02.ucera.local/172.18.4.47:10200
19/07/25 14:48:07 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/ml1user/.staging/job_1564020515809_0006
19/07/25 14:48:08 INFO mapreduce.JobSubmitter: number of splits:3
19/07/25 14:48:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1564020515809_0006
19/07/25 14:48:08 INFO mapreduce.JobSubmitter: Executing with tokens: []
19/07/25 14:48:08 INFO conf.Configuration: found resource resource-types.xml at file:/etc/hadoop/3.1.0.0-78/0/resource-types.xml
19/07/25 14:48:08 INFO impl.YarnClientImpl: Submitted application application_1564020515809_0006
19/07/25 14:48:08 INFO mapreduce.Job: The url to track the job: http://HW01.ucera.local:8088/proxy/application_1564020515809_0006/
Job name 'H2O_47159' submitted
JobTracker job ID is 'job_1564020515809_0006'
For YARN users, logs command is 'yarn logs -applicationId application_1564020515809_0006'
Waiting for H2O cluster to come up...
ERROR: Timed out waiting for H2O cluster to come up (120 seconds)
ERROR: (Try specifying the -timeout option to increase the waiting time limit)
Attempting to clean up hadoop job...
19/07/25 14:50:19 INFO impl.YarnClientImpl: Killed application application_1564020515809_0006
Killed.
19/07/25 14:50:23 INFO client.RMProxy: Connecting to ResourceManager at hw01.ucera.local/172.18.4.46:8050
19/07/25 14:50:23 INFO client.AHSProxy: Connecting to Application History server at hw02.ucera.local/172.18.4.47:10200

----- YARN cluster metrics -----
Number of YARN worker nodes: 3

----- Nodes -----
Node: http://HW03.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
Node: http://HW04.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
Node: http://HW02.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used

----- Queues -----
Queue name:            default
    Queue state:       RUNNING
    Current capacity:  0.00
    Capacity:          1.00
    Maximum capacity:  1.00
    Application count: 0

Queue 'default' approximate utilization: 0.0 / 45.0 GB used, 0 / 9 vcores used

----------------------------------------------------------------------

ERROR: Unable to start any H2O nodes; please contact your YARN administrator.

       A common cause for this is the requested container size (11.0 GB)
       exceeds the following YARN settings:

           yarn.nodemanager.resource.memory-mb
           yarn.scheduler.maximum-allocation-mb

----------------------------------------------------------------------

For YARN users, logs command is 'yarn logs -applicationId application_1564020515809_0006'

查看 Ambari 中的 YARN 配置 UI,找不到这些属性。但是检查 YARN 资源管理器中的 YARN 日志 UI 并检查一些被杀死的应用程序的日志,我看到似乎是无法访问的主机错误...

Container: container_e05_1564020515809_0006_02_000002 on HW03.ucera.local_45454_1564102219781
LogAggregationType: AGGREGATED
=============================================================================================
LogType:stderr
LogLastModifiedTime:Thu Jul 25 14:50:19 -1000 2019
LogLength:2203
LogContents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/yarn/local/filecache/11/mapreduce.tar.gz/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/yarn/local/usercache/ml1user/appcache/application_1564020515809_0006/filecache/10/job.jar/job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapred.YarnChild).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
java.net.NoRouteToHostException: No route to host (Host unreachable)
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    ....
    at java.net.Socket.<init>(Socket.java:211)
    at water.hadoop.EmbeddedH2OConfig$BackgroundWriterThread.run(EmbeddedH2OConfig.java:38)

End of LogType:stderr
***********************************************************************

注意到 "java.net.NoRouteToHostException: No route to host (Host unreachable)"。但是,我可以相互访问所有其他节点,并且它们都可以相互 ping 通,所以不确定这里发生了什么。对调试或修复有什么建议吗?

我想我发现了问题,TLDR:firewalld(centos7 上的节点 运行ning)仍然是 运行ning,什么时候应该在 HDP 集群上禁用.

来自另一个社区post

For Ambari to communicate during setup with the hosts it deploys to and manages, certain ports must be open and available. The easiest way to do this is to temporarily disable iptables, as follows:

systemctl disable firewalld

service firewalld stop

显然 iptablesfirewalld 需要在整个集群中禁用(可以找到支持文档 here, I only disabled them on the Ambari installation node). After stopping these services across the cluster (I recommend using clush),能够 运行 yarn 作业而无需事件。

通常,此问题是由于错误的 DNS 配置、防火墙或网络无法访问造成的。引用this官方文档:

  • The hostname of the remote machine is wrong in the configuration files
  • The client's host table /etc/hosts has an invalid IPAddress for the target host.
  • The DNS server's host table has an invalid IPAddress for the target host.
  • The client's routing tables (In Linux, iptables) are wrong.
  • The DHCP server is publishing bad routing information.
  • Client and server are on different subnets, and are not set up to talk to each other. This may be an accident, or it is to deliberately lock down the Hadoop cluster.
  • The machines are trying to communicate using IPv6. Hadoop does not currently support IPv6
  • The host's IP address has changed but a long-lived JVM is caching the old value. This is a known problem with JVMs (search for "java negative DNS caching" for the details and solutions). The quick solution: restart the JVMs

对我来说,问题在于驱动程序位于 Docker 容器内,这使得工作人员无法将数据发回给它。换句话说,worker 和 driver 不在同一个子网中。 中给出的解决方案是设置以下配置:

spark.driver.host=<container's host IP accessible by the workers>
spark.driver.bindAddress=0.0.0.0
spark.driver.port=<forwarded port 1>
spark.driver.blockManager.port=<forwarded port 2>