无法检索领导者网关和端口

Failed to retrieve leader gateway and port

我正在尝试在每个作业 YARN 会话的上下文中使用 JobManager HA 几天前的 1.0.0-rc3 有几个任务管理器的问题 网络接口。

手动杀死job manager进程后,新分配的jobmanager.log 第二职业经理阅读:

2016-03-02 18:01:09,635 WARN  Remoting                                                   
  - Tried to associate with unreachable remote address [akka.tcp://flink@10.127.68.136:34811].
Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters.
Reason: Connection refused: /10.127.68.136:34811
2016-03-02 18:01:09,644 WARN  org.apache.flink.runtime.webmonitor.JobManagerRetriever    
  - Failed to retrieve leader gateway and port.
akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://flink@10.127.68.136:34811/),
Path(/user/jobmanager)]
    at akka.actor.ActorSelection$$anonfun$resolveOne.apply(ActorSelection.scala:65)
    at akka.actor.ActorSelection$$anonfun$resolveOne.apply(ActorSelection.scala:63)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run.processBatch(BatchingExecutor.scala:67)
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run.apply$mcV$sp(BatchingExecutor.scala:82)
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run.apply(BatchingExecutor.scala:59)
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run.apply(BatchingExecutor.scala:59)
    at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
    at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
    at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
    at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
    at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
    at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
    at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
    at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267)
    at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:508)
    at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:541)
    at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:531)
    at akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:87)
    at akka.remote.EndpointWriter.postStop(Endpoint.scala:561)
    at akka.actor.Actor$class.aroundPostStop(Actor.scala:475)
    at akka.remote.EndpointActor.aroundPostStop(Endpoint.scala:415)
    at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)
    at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172)
    at akka.actor.ActorCell.terminate(ActorCell.scala:369)
    at akka.actor.ActorCell.invokeAll(ActorCell.scala:462)
    at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
    at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279)
    at akka.dispatch.Mailbox.run(Mailbox.scala:220)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

未找到的 IP 来自旧作业管理器。到目前为止,这是预期的行为吗?

然后问题出现在新的任务管理器上,它也尝试连接到旧作业 经理不成功。 ZooKeeperLeaderRetrievalService 开始循环遍历可用的 网络接口,可以在相关的 taskmanager.log:

中看到
2016-03-02 18:01:13,636 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
 - Starting ZooKeeperLeaderRetrievalService.
2016-03-02 18:01:13,646 INFO  org.apache.flink.runtime.util.LeaderRetrievalUtils         
  - Trying to select the network interface and address to use by connecting to the leading
JobManager.
2016-03-02 18:01:13,646 INFO  org.apache.flink.runtime.util.LeaderRetrievalUtils         
  - TaskManager will try to connect for 10000 milliseconds before falling back to heuristics
2016-03-02 18:01:13,712 INFO  org.apache.flink.runtime.net.ConnectionUtils               
  - Retrieved new target address /10.127.68.136:34811.
2016-03-02 18:01:14,079 INFO  org.apache.flink.runtime.net.ConnectionUtils               
  - Trying to connect to address /10.127.68.136:34811
2016-03-02 18:01:14,082 INFO  org.apache.flink.runtime.net.ConnectionUtils               
  - Failed to connect from address 'task.manager.eth0.hostname.com/10.127.68.136': Connection
refused
2016-03-02 18:01:14,082 INFO  org.apache.flink.runtime.net.ConnectionUtils               
  - Failed to connect from address '/10.127.68.136': Connection refused
2016-03-02 18:01:14,082 INFO  org.apache.flink.runtime.net.ConnectionUtils               
  - Failed to connect from address '/10.120.193.110': Connection refused
2016-03-02 18:01:14,082 INFO  org.apache.flink.runtime.net.ConnectionUtils               
  - Failed to connect from address '/10.127.68.136': Connection refused
2016-03-02 18:01:14,083 INFO  org.apache.flink.runtime.net.ConnectionUtils               
  - Failed to connect from address '/127.0.0.1': Connection refused
2016-03-02 18:01:14,083 INFO  org.apache.flink.runtime.net.ConnectionUtils               
  - Failed to connect from address '/10.120.193.110': Connection refused
2016-03-02 18:01:14,083 INFO  org.apache.flink.runtime.net.ConnectionUtils               
  - Failed to connect from address '/10.127.68.136': Connection refused
2016-03-02 18:01:14,083 INFO  org.apache.flink.runtime.net.ConnectionUtils               
  - Failed to connect from address '/127.0.0.1': Connection refused

重复五次后,任务管理器停止尝试检索领导者并使用 HEURISTIC 策略最终使用 eth1 (10.120.193.110) 从现在开始:

2016-03-02 18:01:23,650 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
 - Stopping ZooKeeperLeaderRetrievalService.
2016-03-02 18:01:23,655 INFO  org.apache.zookeeper.ClientCnxn                            
  - EventThread shut down
2016-03-02 18:01:23,655 INFO  org.apache.zookeeper.ZooKeeper                             
  - Session: 0x25229757cff035b closed
2016-03-02 18:01:23,664 INFO  org.apache.flink.runtime.taskmanager.TaskManager           
  - TaskManager will use hostname/address 'task.manager.eth1.hostname.com' (10.120.193.110)
for communication.

发现新的 jobmanager 后,taskmanager 就可以在 使用 eth1 的作业管理器。问题是无法连接到 eth1。所以 flink 应该始终使用 eth0。我们后来看到的异常是:

java.io.IOException: Connecting the channel failed: Connecting to remote task manager + 'other.task.manager.eth1.hostname/10.120.193.111:46620'
has failed. This might indicate that the remote task manager has been lost.
    at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:196)
    at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access[=13=]0(PartitionRequestClientFactory.java:131)
    at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:83)
    at org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:60)
    at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:115)
    at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:388)
    at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:411)
    at org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(BarrierBuffer.java:108)
    at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:175)
    at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:65)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:224)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
    at java.lang.Thread.run(Thread.java:744)

根本原因好像是网络接口选择还在用旧的jobmanager 位置,因此无法选择正确的接口。特别是,似乎 HEURISTIC 和 SLOW 策略在网络接口上的迭代顺序不同, 然后导致选择了错误的接口。

您的问题应该会在即将发布的错误修复版本 1.0.1 中得到解决。或者,您也可以暂时使用当前的 1.1-SNAPSHOT 版本的 Flink。