flink HA独立集群失败
flink HA standalone cluster failed
2台电脑,203,204
运行 jobmanager
和 taskmanager
在每台计算机上
大师
hz203:9081
hz204:9081
奴隶
hz203
hz204
flink-conf.yaml
jobmanager.rpc.port: 6123
rest.port: 9081
blob.server.port: 6124
query.server.port: 6125
web.tmpdir: /home/ctu/flink/deploy/webTmp
web.log.path: /home/ctu/flink/deploy/log
taskmanager.tmp.dirs: /home/ctu/flink/deploy/taskManagerTmp
high-availability: zookeeper
high-availability.storageDir: file:///home/ctu/flink/deploy/HA
high-availability.zookeeper.quorum: 10.0.1.79:2181
high-availability.zookeeper.path.root: /flink
high-availability.cluster-id: /flink
运行 ./start-cluster.sh
Starting HA cluster with 2 masters.
Starting standalonesession daemon on host hz203.
Starting standalonesession daemon on host hz204.
Starting taskexecutor daemon on host hz203.
Starting taskexecutor daemon on host hz204.
日志
2018-12-20 20:44:03,843 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
2018-12-20 20:44:03,864 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Web frontend listening at http://127.0.0.1:9081.
2018-12-20 20:44:03,875 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/resourcemanager .
2018-12-20 20:44:03,989 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher .
2018-12-20 20:44:03,999 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
2018-12-20 20:44:04,008 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2018-12-20 20:44:04,009 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
2018-12-20 20:44:04,010 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2018-12-20 20:44:04,206 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /127.0.0.1:43012
2018-12-20 20:44:04,221 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@127.0.0.1:43012] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@127.0.0.1:43012]] Caused by: [Connection refused: /127.0.0.1:43012]
2018-12-20 20:44:04,301 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /127.0.0.1:43012
2018-12-20 20:44:04,301 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@127.0.0.1:43012] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@127.0.0.1:43012]] Caused by: [Connection refused: /127.0.0.1:43012]
2018-12-20 20:44:04,378 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /127.0.0.1:43012
2018-12-20 20:44:04,378 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@127.0.0.1:43012] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@127.0.0.1:43012]] Caused by: [Connection refused: /127.0.0.1:43012]
2018-12-20 20:44:04,451 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /127.0.0.1:43012
2018-12-20 20:44:04,451 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@127.0.0.1:43012] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@127.0.0.1:43012]] Caused by: [Connection refused: /127.0.0.1:43012]
2018-12-20 20:44:04,520 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /127.0.0.1:43012
问题
`akka.tcp://flink@127.0.0.1:33567/user/resourcemanager` --- Why the 127.0.0.1 instead of the `jobmanager` ip in the `masters's` config file?
问题是我们在版本 1.6.1
中修复的错误。在 1.6.0
中,我们没有遵守方法 ClusterEntrypoint#loadConfiguration
中的 --host
命令行选项,如您所见 here compared to the code of version 1.6.1.
因此,升级到最新的 1.6.x 版本应该可以解决问题。一般来说,如果可能的话,我总是建议升级到最新的错误修复版本。
2台电脑,203,204
运行 jobmanager
和 taskmanager
在每台计算机上
hz203:9081
hz204:9081
奴隶
hz203
hz204
flink-conf.yaml
jobmanager.rpc.port: 6123
rest.port: 9081
blob.server.port: 6124
query.server.port: 6125
web.tmpdir: /home/ctu/flink/deploy/webTmp
web.log.path: /home/ctu/flink/deploy/log
taskmanager.tmp.dirs: /home/ctu/flink/deploy/taskManagerTmp
high-availability: zookeeper
high-availability.storageDir: file:///home/ctu/flink/deploy/HA
high-availability.zookeeper.quorum: 10.0.1.79:2181
high-availability.zookeeper.path.root: /flink
high-availability.cluster-id: /flink
运行 ./start-cluster.sh
Starting HA cluster with 2 masters.
Starting standalonesession daemon on host hz203.
Starting standalonesession daemon on host hz204.
Starting taskexecutor daemon on host hz203.
Starting taskexecutor daemon on host hz204.
日志
2018-12-20 20:44:03,843 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
2018-12-20 20:44:03,864 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Web frontend listening at http://127.0.0.1:9081.
2018-12-20 20:44:03,875 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/resourcemanager .
2018-12-20 20:44:03,989 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher .
2018-12-20 20:44:03,999 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
2018-12-20 20:44:04,008 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2018-12-20 20:44:04,009 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
2018-12-20 20:44:04,010 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2018-12-20 20:44:04,206 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /127.0.0.1:43012
2018-12-20 20:44:04,221 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@127.0.0.1:43012] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@127.0.0.1:43012]] Caused by: [Connection refused: /127.0.0.1:43012]
2018-12-20 20:44:04,301 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /127.0.0.1:43012
2018-12-20 20:44:04,301 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@127.0.0.1:43012] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@127.0.0.1:43012]] Caused by: [Connection refused: /127.0.0.1:43012]
2018-12-20 20:44:04,378 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /127.0.0.1:43012
2018-12-20 20:44:04,378 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@127.0.0.1:43012] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@127.0.0.1:43012]] Caused by: [Connection refused: /127.0.0.1:43012]
2018-12-20 20:44:04,451 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /127.0.0.1:43012
2018-12-20 20:44:04,451 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@127.0.0.1:43012] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@127.0.0.1:43012]] Caused by: [Connection refused: /127.0.0.1:43012]
2018-12-20 20:44:04,520 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /127.0.0.1:43012
问题
`akka.tcp://flink@127.0.0.1:33567/user/resourcemanager` --- Why the 127.0.0.1 instead of the `jobmanager` ip in the `masters's` config file?
问题是我们在版本 1.6.1
中修复的错误。在 1.6.0
中,我们没有遵守方法 ClusterEntrypoint#loadConfiguration
中的 --host
命令行选项,如您所见 here compared to the code of version 1.6.1.
因此,升级到最新的 1.6.x 版本应该可以解决问题。一般来说,如果可能的话,我总是建议升级到最新的错误修复版本。