如果添加到 zookeeper 的 master 之一关闭,则活动 master 不接受新申请
Active master not accepting new applications if one of the masters added to zookeeper is down
我在 spark 独立集群中启用高可用性 (HA) 时遇到了一个非常奇怪的问题。
我已经配置了 3 个 spark master,并按照以下步骤在 zookeeper 中注册了它们:
- 创建配置文件
ha.conf
,内容如下:
spark.deploy.recoveryMode=ZOOKEEPER
spark.deploy.zookeeper.url=ZK_HOST:2181
spark.deploy.zookeeper.dir=/spark
- 通过将此 属性 文件作为参数传递给 start-master 脚本来启动所有 3 个主机,如下所示:
./start-master.sh -h localhost -p 17077 --webui-port 18080
--properties-file ha.conf
这样我就启动了所有 3 个 spark master 并在 zookeeper 中注册了。
工作
如果我终止活动主控,那么所有 运行 应用程序都会被新的活动主控接收。
不工作
如果任何一个 spark master(例如:localhost:17077)正在 down/not 工作并且我使用以下命令提交申请:
./bin/spark-submit --class WordCount --master spark://localhost:17077,h2:27077,h3:37077 --deploy-mode cluster --conf spark.cores.max=1 ~/TestSpark-0.0.1-SNAPSHOT.jar /user1/test.txt
理想情况下,应该交给活动的主人,并且应该可以正常工作,因为只有一个主人已关闭而其他人正在工作,但我遇到异常:
Exception in thread "main" org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun.applyOrElse(RpcTimeout.scala:77)
at org.apache.spark.rpc.RpcTimeout$$anonfun.applyOrElse(RpcTimeout.scala:75)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout.applyOrElse(RpcTimeout.scala:59)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
at org.apache.spark.deploy.Client$$anonfun.apply(Client.scala:230)
at org.apache.spark.deploy.Client$$anonfun.apply(Client.scala:230)
at scala.collection.TraversableLike$$anonfun$map.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.deploy.Client$.main(Client.scala:230)
at org.apache.spark.deploy.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
at org.apache.spark.deploy.SparkSubmit$.doRunMain(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: Failed to connect to localhost/127.0.0.1:17077
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
at org.apache.spark.rpc.netty.Outbox$$anon.call(Outbox.scala:191)
at org.apache.spark.rpc.netty.Outbox$$anon.call(Outbox.scala:187)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.ConnectException: Connection refused: localhost/127.0.0.1:17077
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor.run(SingleThreadEventExecutor.java:111)
... 1 more
任何 help/clue/suggestion 表示赞赏。请帮助我理解这一点,我已经搜索过类似的问题但找不到任何东西。
更新
我在集群模式下提交申请时遇到此问题,而在客户端模式下提交申请则没有问题。
可以将应用程序提交到运行在 6066 上的 spark rest 服务器,而不是提交到运行在 7077 上的遗留系统上。
因此,当使用以下命令将应用程序提交到休息服务器时,问题得到解决:
./bin/spark-submit --class WordCount --master spark://localhost:6066,h2:6066,h3:6066 --deploy-mode cluster --conf spark.cores.max=1 ~/TestSpark-0.0.1-SNAPSHOT.jar /user1/test.txt
现在,如果一个 spark master 宕机,那么应用程序将提交给另一个 spark master。
我在 spark 独立集群中启用高可用性 (HA) 时遇到了一个非常奇怪的问题。
我已经配置了 3 个 spark master,并按照以下步骤在 zookeeper 中注册了它们:
- 创建配置文件
ha.conf
,内容如下:
spark.deploy.recoveryMode=ZOOKEEPER
spark.deploy.zookeeper.url=ZK_HOST:2181
spark.deploy.zookeeper.dir=/spark
- 通过将此 属性 文件作为参数传递给 start-master 脚本来启动所有 3 个主机,如下所示:
./start-master.sh -h localhost -p 17077 --webui-port 18080 --properties-file ha.conf
这样我就启动了所有 3 个 spark master 并在 zookeeper 中注册了。
工作 如果我终止活动主控,那么所有 运行 应用程序都会被新的活动主控接收。
不工作 如果任何一个 spark master(例如:localhost:17077)正在 down/not 工作并且我使用以下命令提交申请:
./bin/spark-submit --class WordCount --master spark://localhost:17077,h2:27077,h3:37077 --deploy-mode cluster --conf spark.cores.max=1 ~/TestSpark-0.0.1-SNAPSHOT.jar /user1/test.txt
理想情况下,应该交给活动的主人,并且应该可以正常工作,因为只有一个主人已关闭而其他人正在工作,但我遇到异常:
Exception in thread "main" org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun.applyOrElse(RpcTimeout.scala:77)
at org.apache.spark.rpc.RpcTimeout$$anonfun.applyOrElse(RpcTimeout.scala:75)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout.applyOrElse(RpcTimeout.scala:59)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
at org.apache.spark.deploy.Client$$anonfun.apply(Client.scala:230)
at org.apache.spark.deploy.Client$$anonfun.apply(Client.scala:230)
at scala.collection.TraversableLike$$anonfun$map.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.deploy.Client$.main(Client.scala:230)
at org.apache.spark.deploy.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
at org.apache.spark.deploy.SparkSubmit$.doRunMain(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: Failed to connect to localhost/127.0.0.1:17077
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
at org.apache.spark.rpc.netty.Outbox$$anon.call(Outbox.scala:191)
at org.apache.spark.rpc.netty.Outbox$$anon.call(Outbox.scala:187)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.ConnectException: Connection refused: localhost/127.0.0.1:17077
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor.run(SingleThreadEventExecutor.java:111)
... 1 more
任何 help/clue/suggestion 表示赞赏。请帮助我理解这一点,我已经搜索过类似的问题但找不到任何东西。
更新
我在集群模式下提交申请时遇到此问题,而在客户端模式下提交申请则没有问题。
可以将应用程序提交到运行在 6066 上的 spark rest 服务器,而不是提交到运行在 7077 上的遗留系统上。
因此,当使用以下命令将应用程序提交到休息服务器时,问题得到解决:
./bin/spark-submit --class WordCount --master spark://localhost:6066,h2:6066,h3:6066 --deploy-mode cluster --conf spark.cores.max=1 ~/TestSpark-0.0.1-SNAPSHOT.jar /user1/test.txt
现在,如果一个 spark master 宕机,那么应用程序将提交给另一个 spark master。