Ambari Hadoop Spark 集群防火墙问题

Ambari Hadoop Spark Cluster Firewall Issues

我刚刚推出了一个 Hadoop/Spark 集群,以努力启动我公司的数据科学计划。我使用 Ambari 作为管理器并安装了 Hortonworks 发行版(HDFS 2.7.3、Hive 1.2.1、Spark 2.1.1 以及其他所需的服务。顺便说一下,我正在 运行ning RHEL 7。我有 2 个名称节点、10 个数据节点、1 个配置单元节点和 1 个管理节点 (Ambari)。

我根据 Apache 和 Ambari 文档构建了一个防火墙端口列​​表,并让我的基础架构人员推动这些规则。我 运行 遇到了 Spark 想要选择 运行dom 端口的问题。当我尝试 运行 Spark 作业(传统的 Pi 示例)时,它会失败,因为我没有打开整个临时端口 运行ge。由于我们可能会 运行 多个作业,因此让 Spark 处理这个并仅从短暂的 运行ge 端口 (1024 - 65535) 中选择而不是指定单个端口是有意义的。我知道我可以选择一个 运行ge,但为了简单起见,我只是让我的手下打开整个临时 运行ge。起初,我的基础架构人员对此犹豫不决,但当我告诉他们目的时,他们就照办了。

基于此,我认为我的问题已经解决,但是当我尝试 运行 一份工作时,它仍然失败:

Log Type: stderr 

Log Upload Time: Thu Oct 12 11:31:01 -0700 2017 

Log Length: 14617 

Showing 4096 bytes of 14617 total. Click here for the full log. 

Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:52 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:53 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:54 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:55 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:56 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:57 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:57 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:59 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:00 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:01 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:02 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:03 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:04 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:05 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:06 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:06 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:07 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:09 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:10 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:11 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:12 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:13 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:14 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:15 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:15 ERROR ApplicationMaster: Uncaught exception: 
org.apache.spark.SparkException: Failed to connect to driver!
    at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver(ApplicationMaster.scala:607)
    at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:461)
    at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:283)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main.apply$mcV$sp(ApplicationMaster.scala:783)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon.run(SparkHadoopUtil.scala:67)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon.run(SparkHadoopUtil.scala:66)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
    at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:781)
    at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:804)
    at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
17/10/12 11:29:15 INFO ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: org.apache.spark.SparkException: Failed to connect to driver!)
17/10/12 11:29:15 INFO ShutdownHookManager: Shutdown hook called

起初我以为我对 Spark 和 namenodes/datanodes 的配置可能存在某种错误。然而,为了测试它,我只是在每个节点上停止了 firewalld 并再次尝试这项工作,它工作得很好。

所以,我的问题 - 我打开了整个 1024 - 65535 端口 运行ge - 我可以看到 Spark 驱动程序正在尝试连接那些高端口(如上所示 - 30k - 40k 运行ge).但是,由于某种原因,当防火墙打开时,它会失败,而当它关闭时,它会工作。我检查了防火墙规则,果然,端口是打开的——这些规则正在工作,因为我可以访问在同一个 firewalld xml 规则文件中指定的 Ambair、Yarn 和 HFDS 的网络服务....

我是 Hadoop/Spark 的新手,所以我想知道我是否遗漏了什么?我需要考虑 1024 以下的一些较低端口吗?这是我打开的1024以下的端口列表,除了1024-65535端口运行ge:

88
111
443
1004
1006
1019

我很可能错过了一个我真正需要但不知道的较低编号的端口。除此之外,其他所有内容都应由 1024 - 65535 端口 运行ge.

处理

好的,与 Hortonworks 社区的一些人一起工作,我能够想出一个解决方案。基本上,您需要至少定义一个端口,但您可以通过指定 spark.port.MaxRetries = xxxxx 来扩展它。通过将此设置与 spark.driver.port = xxxxx 结合使用,您可以得到一个从 spark.driver.port 开始到 spark.port.maxRetries 结束的范围。

如果您使用 Ambari 作为管理器,设置在 "Custom spark2-defaults" 部分下(我假设在完全开源堆栈安装下,这只是普通 Spark 配置下的设置):

我被建议用 32 个计数块分隔这些端口,例如,如果你的驱动程序从 40000 开始,你应该从 40033 开始 spark.blockManager.port,等等。见此 post 在:

https://community.hortonworks.com/questions/141484/spark-jobs-failing-firewall-issue.html