mesos-master 找不到mesos-slave,并在短时间内选举出新的leader

mesos-master can not found mesos-slave, and elect a new leader in a short interval

我按照这个doc设置mesos集群。

一共有三个vm(ubuntu12,centos 6.5,centos 7.2).

$ cat /etc/hosts
10.142.55.190 zk1
10.142.55.196 zk2
10.142.55.202 zk3

每个数学中的配置:

$ cat /etc/mesos/zk
zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos

在三个vm中启动zookeeper、mesos-master和mesos-slave后,可以看到mesos webui(10.142.55.190:5050),但是agent数为0

过了一会儿,mesos页面报错: 无法连接到 10.142.55.190:5050! 16 秒后重试... (现在我发现zookeeper在很短的时间内选举了一个新的leader)

主信息日志:

I0919 15:54:59.677438 13281 http.cpp:2022] Redirecting request for /master/state?jsonp=angular.callbacks._1x to the leading master zk3
I0919 15:55:00.098667 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (768)@10.142.55.202:5050
I0919 15:55:00.385279 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (185)@10.142.55.196:5050
I0919 15:55:00.711119 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (771)@10.142.55.202:5050
I0919 15:55:01.347291 13284 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (188)@10.142.55.196:5050
I0919 15:55:01.597682 13284 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (774)@10.142.55.202:5050
I0919 15:55:02.257159 13282 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (191)@10.142.55.196:5050
I0919 15:55:02.370692 13287 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (777)@10.142.55.202:5050
I0919 15:55:03.205920 13285 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (780)@10.142.55.202:5050
I0919 15:55:03.260007 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (194)@10.142.55.196:5050
I0919 15:55:03.929611 13283 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (783)@10.142.55.202:5050
I0919 15:55:04.033308 13287 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (197)@10.142.55.196:5050
I0919 15:55:04.591275 13284 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (200)@10.142.55.196:5050
I0919 15:55:04.608211 13283 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (786)@10.142.55.202:5050
I0919 15:55:05.184682 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (789)@10.142.55.202:5050
I0919 15:55:05.268277 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (203)@10.142.55.196:5050
I0919 15:55:05.775377 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (206)@10.142.55.196:5050
I0919 15:55:05.916445 13285 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (792)@10.142.55.202:5050
I0919 15:55:06.744927 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (209)@10.142.55.196:5050
I0919 15:55:07.378521 13283 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (5)@10.142.55.202:5050
I0919 15:55:07.393311 13285 network.hpp:430] ZooKeeper group memberships changed
I0919 15:55:07.393427 13285 group.cpp:706] Trying to get '/mesos/log_replicas/0000000709' in ZooKeeper
I0919 15:55:07.393985 13285 group.cpp:706] Trying to get '/mesos/log_replicas/0000000711' in ZooKeeper
I0919 15:55:07.394394 13285 group.cpp:706] Trying to get '/mesos/log_replicas/0000000714' in ZooKeeper
I0919 15:55:07.394843 13285 group.cpp:706] Trying to get '/mesos/log_replicas/0000000715' in ZooKeeper
I0919 15:55:07.395418 13285 network.hpp:478] ZooKeeper group PIDs: { log-replica(1)@10.142.55.190:5050, log-replica(1)@10.142.55.196:5050, log-replica(1)@10.142.55.202:5050 }
I0919 15:55:08.178272 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (14)@10.142.55.202:5050
I0919 15:55:09.059562 13282 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (21)@10.142.55.202:5050
I0919 15:55:09.700711 13286 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (24)@10.142.55.202:5050
I0919 15:55:09.742185 13287 http.cpp:381] HTTP GET for /master/state from 10.142.50.94:59987 with User-Agent='Mozilla/5.0 (Windows NT 6.2; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0'
I0919 15:55:09.742359 13287 http.cpp:2022] Redirecting request for /master/state?jsonp=angular.callbacks._1y to the leading master zk3
I0919 15:55:10.660789 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (30)@10.142.55.202:5050
I0919 15:55:11.480326 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (34)@10.142.55.202:5050
I0919 15:55:12.386256 13286 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (37)@10.142.55.202:5050
I0919 15:55:12.975137 13287 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (42)@10.142.55.202:5050
I0919 15:55:13.843091 13285 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (47)@10.142.55.202:5050
I0919 15:55:14.373478 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (51)@10.142.55.202:5050
I0919 15:55:14.937181 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (54)@10.142.55.202:5050
I0919 15:55:15.658219 13283 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (58)@10.142.55.202:5050
I0919 15:55:16.007822 13286 network.hpp:430] ZooKeeper group memberships changed
I0919 15:55:16.007972 13286 group.cpp:706] Trying to get '/mesos/log_replicas/0000000711' in ZooKeeper
I0919 15:55:16.010170 13286 group.cpp:706] Trying to get '/mesos/log_replicas/0000000714' in ZooKeeper
I0919 15:55:16.011462 13284 detector.cpp:152] Detected a new leader: (id='702')
I0919 15:55:16.011556 13284 group.cpp:706] Trying to get '/mesos/json.info_0000000702' in ZooKeeper
I0919 15:55:16.011968 13286 group.cpp:706] Trying to get '/mesos/log_replicas/0000000715' in ZooKeeper
I0919 15:55:16.012526 13286 network.hpp:478] ZooKeeper group PIDs: { log-replica(1)@10.142.55.190:5050, log-replica(1)@10.142.55.196:5050, log-replica(1)@10.142.55.202:5050 }
I0919 15:55:16.013156 13284 zookeeper.cpp:259] A new leading master (UPID=master@10.142.55.190:5050) is detected
I0919 15:55:16.013222 13284 master.cpp:1847] The newly elected leader is master@10.142.55.190:5050 with id 677967bc-f6f0-46b3-a44e-72eed1befd60
I0919 15:55:16.013244 13284 master.cpp:1860] Elected as the leading master!
I0919 15:55:16.013273 13284 master.cpp:1547] Recovering from registrar
I0919 15:55:16.013352 13284 registrar.cpp:332] Recovering registrar
I0919 15:55:16.014081 13280 log.cpp:553] Attempting to start the writer
I0919 15:55:16.014515 13280 replica.cpp:493] Replica received implicit promise request from (211)@10.142.55.190:5050 with proposal 1204590
I0919 15:55:16.018023 13282 consensus.cpp:360] Aborting implicit promise request because 2 ignores received
I0919 15:55:16.018028 13280 leveldb.cpp:304] Persisting metadata (10 bytes) to leveldb took 3.469479ms
I0919 15:55:16.018338 13280 replica.cpp:342] Persisted promised to 1204590
I0919 15:55:16.018508 13282 log.cpp:565] Could not start the writer, but can be retried
I0919 15:55:16.018645 13282 log.cpp:553] Attempting to start the writer
I0919 15:55:16.018899 13282 replica.cpp:493] Replica received implicit promise request from (215)@10.142.55.190:5050 with proposal 1204591
I0919 15:55:16.022183 13287 consensus.cpp:360] Aborting implicit promise request because 2 ignores received
I0919 15:55:16.022367 13280 log.cpp:565] Could not start the writer, but can be retried
I0919 15:55:16.022510 13280 log.cpp:553] Attempting to start the writer
I0919 15:55:16.028880 13282 leveldb.cpp:304] Persisting metadata (10 bytes) to leveldb took 9.870818ms
I0919 15:55:16.029024 13282 replica.cpp:342] Persisted promised to 1204591
I0919 15:55:16.029428 13286 replica.cpp:493] Replica received implicit promise request from (219)@10.142.55.190:5050 with proposal 1204592
I0919 15:55:16.031600 13280 consensus.cpp:360] Aborting implicit promise request because 2 ignores received
I0919 15:55:16.036208 13283 log.cpp:565] Could not start the writer, but can be retried
I0919 15:55:16.036454 13283 log.cpp:553] Attempting to start the writer
I0919 15:55:16.040256 13286 leveldb.cpp:304] Persisting metadata (10 bytes) to leveldb took 10.783237ms
I0919 15:55:16.040339 13286 replica.cpp:342] Persisted promised to 1204592
I0919 15:55:16.040712 13286 replica.cpp:493] Replica received implicit promise request from (222)@10.142.55.190:5050 with proposal 1204593
I0919 15:55:16.042196 13286 leveldb.cpp:304] Persisting metadata (10 bytes) to leveldb took 1.435071ms
I0919 15:55:16.042250 13286 replica.cpp:342] Persisted promised to 1204593
I0919 15:55:16.042981 13286 consensus.cpp:360] Aborting implicit promise request because 2 ignores received
I0919 15:55:16.043099 13286 log.cpp:565] Could not start the writer, but can be retried
I0919 15:55:16.043303 13283 log.cpp:553] Attempting to start the writer

所有以后的日志都在循环

I0919 15:55:16.043676 13286 replica.cpp:493] Replica received implicit promise request from (225)@10.142.55.190:5050 with proposal 1204594
I0919 15:55:16.044122 13286 leveldb.cpp:304] Persisting metadata (10 bytes) to leveldb took 404769ns
I0919 15:55:16.044209 13286 replica.cpp:342] Persisted promised to 1204594
I0919 15:55:16.044837 13281 consensus.cpp:360] Aborting implicit promise request because 2 ignores received
I0919 15:55:16.044926 13281 log.cpp:565] Could not start the writer, but can be retried
I0919 15:55:16.045038 13281 log.cpp:553] Attempting to start the writer

从站信息日志:

Log file created at: 2016/09/19 15:41:16
Running on machine: ubuntu12
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0919 15:41:16.346844 12986 logging.cpp:194] INFO level logging started!
I0919 15:41:16.363313 12986 containerizer.cpp:196] Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni
I0919 15:41:16.370334 12986 main.cpp:434] Starting Mesos agent
I0919 15:41:16.371184 12986 slave.cpp:198] Agent started on 1)@127.0.1.1:5051
I0919 15:41:16.371636 12986 slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://" --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io" --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --image_provisioner_backend="copy" --initialize_driver_logging="true" --isolation="posix/cpu,posix/mem" --launcher="posix" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos"
I0919 15:41:16.373072 12986 slave.cpp:519] Agent resources: cpus(*):2; mem(*):2930; disk(*):4469; ports(*):[31000-32000]
I0919 15:41:16.373291 12986 slave.cpp:527] Agent attributes: [  ]
I0919 15:41:16.373347 12986 slave.cpp:532] Agent hostname: ubuntu12
I0919 15:41:16.379895 13005 state.cpp:57] Recovering state from '/var/lib/mesos/meta'
I0919 15:41:16.382519 13005 group.cpp:349] Group process (group(1)@127.0.1.1:5051) connected to ZooKeeper
I0919 15:41:16.382593 13005 group.cpp:837] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0919 15:41:16.382663 13005 group.cpp:427] Trying to create path '/mesos' in ZooKeeper
I0919 15:41:16.382910 13009 status_update_manager.cpp:200] Recovering status update manager
I0919 15:41:16.383419 13009 containerizer.cpp:522] Recovering containerizer
I0919 15:41:16.392206 13004 provisioner.cpp:253] Provisioner recovery complete
I0919 15:41:16.392354 13004 slave.cpp:4782] Finished recovery
I0919 15:41:16.405709 13004 detector.cpp:152] Detected a new leader: (id='678')
I0919 15:41:16.406067 13005 group.cpp:706] Trying to get '/mesos/json.info_0000000678' in ZooKeeper
I0919 15:41:16.407572 13002 zookeeper.cpp:259] A new leading master (UPID=master@10.142.55.190:5050) is detected
I0919 15:41:16.407977 13002 slave.cpp:895] New master detected at master@10.142.55.190:5050
I0919 15:41:16.408043 13002 slave.cpp:916] No credentials provided. Attempting to register without authentication
I0919 15:41:16.408140 13002 slave.cpp:927] Detecting new master
I0919 15:41:16.408223 13005 status_update_manager.cpp:174] Pausing sending status updates
I0919 15:42:08.418956 13006 slave.cpp:3732] master@10.142.55.190:5050 exited
W0919 15:42:08.419035 13006 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
I0919 15:42:16.374977 13007 slave.cpp:4591] Current disk usage 72.41%. Max allowed age: 1.231186482451933days
I0919 15:42:20.007169 13007 detector.cpp:152] Detected a new leader: (id='679')
I0919 15:42:20.007297 13007 group.cpp:706] Trying to get '/mesos/json.info_0000000679' in ZooKeeper
I0919 15:42:20.008503 13007 zookeeper.cpp:259] A new leading master (UPID=master@10.142.55.196:5050) is detected
I0919 15:42:20.008587 13007 slave.cpp:895] New master detected at master@10.142.55.196:5050
I0919 15:42:20.008610 13007 slave.cpp:916] No credentials provided. Attempting to register without authentication
I0919 15:42:20.008703 13007 slave.cpp:927] Detecting new master
I0919 15:42:20.008750 13007 status_update_manager.cpp:174] Pausing sending status updates
I0919 15:43:16.387984 13005 slave.cpp:4591] Current disk usage 72.41%. Max allowed age: 1.231162010606794days
I0919 15:43:20.081272 13005 slave.cpp:3732] master@10.142.55.196:5050 exited
W0919 15:43:20.081374 13005 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
I0919 15:43:26.855154 13005 slave.cpp:3732] master@10.142.55.196:5050 exited
W0919 15:43:26.855315 13005 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0919 15:43:26.855159 13010 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
I0919 15:43:32.020196 13002 detector.cpp:152] Detected a new leader: (id='682')
I0919 15:43:32.020300 13002 group.cpp:706] Trying to get '/mesos/json.info_0000000682' in ZooKeeper
I0919 15:43:32.022203 13002 zookeeper.cpp:259] A new leading master (UPID=master@10.142.55.202:5050) is detected
I0919 15:43:32.022302 13002 slave.cpp:895] New master detected at master@10.142.55.202:5050
I0919 15:43:32.022328 13002 slave.cpp:916] No credentials provided. Attempting to register without authentication
I0919 15:43:32.022382 13002 slave.cpp:927] Detecting new master
I0919 15:43:32.022423 13002 status_update_manager.cpp:174] Pausing sending status updates
I0919 15:44:16.389369 13003 slave.cpp:4591] Current disk usage 72.41%. Max allowed age: 1.231119184877789days
I0919 15:44:32.535347 13003 slave.cpp:3732] master@10.142.55.202:5050 exited
W0919 15:44:32.535522 13003 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
I0919 15:44:42.005375 13002 detector.cpp:152] Detected a new leader: (id='684')
I0919 15:44:42.005496 13002 group.cpp:706] Trying to get '/mesos/json.info_0000000684' in ZooKeeper
I0919 15:44:42.006367 13002 zookeeper.cpp:259] A new leading master (UPID=master@10.142.55.190:5050) is detected
I0919 15:44:42.006492 13002 slave.cpp:895] New master detected at master@10.142.55.190:5050
I0919 15:44:42.006597 13002 slave.cpp:916] No credentials provided. Attempting to register without authentication
I0919 15:44:42.006675 13002 slave.cpp:927] Detecting new master
I0919 15:44:42.006577 13008 status_update_manager.cpp:174] Pausing sending status updates
I0919 15:45:16.400794 13006 slave.cpp:4591] Current disk usage 72.48%. Max allowed age: 1.226390000804074days
I0919 15:45:42.354790 13005 slave.cpp:3732] master@10.142.55.190:5050 exited
W0919 15:45:42.354857 13005 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
I0919 15:45:54.020563 13002 detector.cpp:152] Detected a new leader: (id='687')
I0919 15:45:54.020756 13002 group.cpp:706] Trying to get '/mesos/json.info_0000000687' in ZooKeeper
I0919 15:45:54.023296 13002 zookeeper.cpp:259] A new leading master (UPID=master@10.142.55.196:5050) is detected
I0919 15:45:54.023455 13002 slave.cpp:895] New master detected at master@10.142.55.196:5050
I0919 15:45:54.023558 13002 slave.cpp:916] No credentials provided. Attempting to register without authentication
I0919 15:45:54.023526 13008 status_update_manager.cpp:174] Pausing sending status updates
I0919 15:45:54.023669 13002 slave.cpp:927] Detecting new master
I0919 15:46:16.402402 13003 slave.cpp:4591] Current disk usage 72.53%. Max allowed age: 1.223205601954942days
I0919 15:46:54.075505 13007 slave.cpp:3732] master@10.142.55.196:5050 exited
W0919 15:46:54.075592 13007 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0919 15:46:56.098012 13010 process.cpp:2105] Failed to shutdown socket with fd 14: Transport endpoint is not connected
I0919 15:46:56.098016 13007 slave.cpp:3732] master@10.142.55.196:5050 exited
W0919 15:46:56.098253 13007 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0919 15:46:56.462254 13010 process.cpp:2105] Failed to shutdown socket with fd 14: Transport endpoint is not connected
I0919 15:46:56.462260 13005 slave.cpp:3732] master@10.142.55.196:5050 exited
W0919 15:46:56.462540 13005 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
I0919 15:47:02.005637 13009 detector.cpp:152] Detected a new leader: (id='688')
I0919 15:47:02.005765 13009 group.cpp:706] Trying to get '/mesos/json.info_0000000688' in ZooKeeper
I0919 15:47:02.006853 13009 zookeeper.cpp:259] A new leading master (UPID=master@10.142.55.202:5050) is detected
I0919 15:47:02.006959 13009 slave.cpp:895] New master detected at master@10.142.55.202:5050
I0919 15:47:02.006986 13009 slave.cpp:916] No credentials provided. Attempting to register without authentication
I0919 15:47:02.007025 13009 slave.cpp:927] Detecting new master
I0919 15:47:02.007061 13009 status_update_manager.cpp:174] Pausing sending status updates
I0919 15:47:16.406669 13008 slave.cpp:4591] Current disk usage 72.53%. Max allowed age: 1.223184189090440days
I0919 15:48:02.950891 13005 slave.cpp:3732] master@10.142.55.202:5050 exited
W0919 15:48:02.950994 13005 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
I0919 15:48:12.006634 13005 detector.cpp:152] Detected a new leader: (id='690')
I0919 15:48:12.006817 13003 group.cpp:706] Trying to get '/mesos/json.info_0000000690' in ZooKeeper
I0919 15:48:12.007987 13003 zookeeper.cpp:259] A new leading master (UPID=master@10.142.55.190:5050) is detected
I0919 15:48:12.008126 13003 slave.cpp:895] New master detected at master@10.142.55.190:5050
I0919 15:48:12.008210 13003 slave.cpp:916] No credentials provided. Attempting to register without authentication
I0919 15:48:12.008280 13003 slave.cpp:927] Detecting new master
I0919 15:48:12.008191 13008 status_update_manager.cpp:174] Pausing sending status updates
I0919 15:48:16.409266 13003 slave.cpp:4591] Current disk usage 72.54%. Max allowed age: 1.222480623542604days
I0919 15:49:12.379010 13009 slave.cpp:3732] master@10.142.55.190:5050 exited
W0919 15:49:12.379149 13009 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0919 15:49:12.379233 13010 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
I0919 15:49:16.413767 13007 slave.cpp:4591] Current disk usage 72.64%. Max allowed age: 1.215032005677465days
I0919 15:49:24.016290 13007 detector.cpp:152] Detected a new leader: (id='693')
I0919 15:49:24.016417 13007 group.cpp:706] Trying to get '/mesos/json.info_0000000693' in ZooKeeper
I0919 15:49:24.018273 13007 zookeeper.cpp:259] A new leading master (UPID=master@10.142.55.196:5050) is detected
I0919 15:49:24.018437 13007 slave.cpp:895] New master detected at master@10.142.55.196:5050
I0919 15:49:24.018523 13007 slave.cpp:916] No credentials provided. Attempting to register without authentication
I0919 15:49:24.018604 13007 slave.cpp:927] Detecting new master
I0919 15:49:24.018496 13008 status_update_manager.cpp:174] Pausing sending status updates
I0919 15:50:16.416391 13008 slave.cpp:4591] Current disk usage 72.64%. Max allowed age: 1.215016710774248days
I0919 15:50:24.065268 13003 slave.cpp:3732] master@10.142.55.196:5050 exited
W0919 15:50:24.065342 13003 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
I0919 15:50:24.485752 13004 slave.cpp:3732] master@10.142.55.196:5050 exited
W0919 15:50:24.485839 13004 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0919 15:50:24.485977 13010 process.cpp:2105] Failed to shutdown socket with fd 14: Transport endpoint is not connected
I0919 15:50:28.343647 13003 slave.cpp:3732] master@10.142.55.196:5050 exited
W0919 15:50:28.343719 13003 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0919 15:50:28.343819 13010 process.cpp:2105] Failed to shutdown socket with fd 14: Transport endpoint is not connected
I0919 15:50:31.545099 13005 slave.cpp:3732] master@10.142.55.196:5050 exited
W0919 15:50:31.545171 13005 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0919 15:50:31.545284 13010 process.cpp:2105] Failed to shutdown socket with fd 14: Transport endpoint is not connected
I0919 15:50:32.007096 13008 detector.cpp:152] Detected a new leader: (id='694')
I0919 15:50:32.007195 13008 group.cpp:706] Trying to get '/mesos/json.info_0000000694' in ZooKeeper
I0919 15:50:32.009881 13008 zookeeper.cpp:259] A new leading master (UPID=master@10.142.55.202:5050) is detected
I0919 15:50:32.009970 13008 slave.cpp:895] New master detected at master@10.142.55.202:5050
I0919 15:50:32.009994 13008 slave.cpp:916] No credentials provided. Attempting to register without authentication
I0919 15:50:32.010030 13008 slave.cpp:927] Detecting new master
I0919 15:50:32.010079 13008 status_update_manager.cpp:174] Pausing sending status updates
I0919 15:51:16.417846 13006 slave.cpp:4591] Current disk usage 72.64%. Max allowed age: 1.214964708103322days
I0919 15:51:32.560317 13003 slave.cpp:3732] master@10.142.55.202:5050 exited
W0919 15:51:32.560410 13003 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
I0919 15:51:42.005147 13009 detector.cpp:152] Detected a new leader: (id='696')
I0919 15:51:42.005265 13009 group.cpp:706] Trying to get '/mesos/json.info_0000000696' in ZooKeeper
I0919 15:51:42.006824 13009 zookeeper.cpp:259] A new leading master (UPID=master@10.142.55.190:5050) is detected
I0919 15:51:42.006904 13009 slave.cpp:895] New master detected at master@10.142.55.190:5050
I0919 15:51:42.006928 13009 slave.cpp:916] No credentials provided. Attempting to register without authentication
I0919 15:51:42.006963 13009 slave.cpp:927] Detecting new master
I0919 15:51:42.006999 13009 status_update_manager.cpp:174] Pausing sending status updates
I0919 15:52:16.419373 13003 slave.cpp:4591] Current disk usage 72.71%. Max allowed age: 1.209981628636250days
I0919 15:52:42.336305 13002 slave.cpp:3732] master@10.142.55.190:5050 exited
W0919 15:52:42.336426 13002 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
I0919 15:52:54.005267 13005 detector.cpp:152] Detected a new leader: (id='699')
I0919 15:52:54.005408 13005 group.cpp:706] Trying to get '/mesos/json.info_0000000699' in ZooKeeper
I0919 15:52:54.006206 13005 zookeeper.cpp:259] A new leading master (UPID=master@10.142.55.196:5050) is detected
I0919 15:52:54.006285 13005 slave.cpp:895] New master detected at master@10.142.55.196:5050
I0919 15:52:54.006309 13005 slave.cpp:916] No credentials provided. Attempting to register without authentication
I0919 15:52:54.006398 13005 slave.cpp:927] Detecting new master
I0919 15:52:54.006451 13005 status_update_manager.cpp:174] Pausing sending status updates
I0919 15:53:16.420258 13005 slave.cpp:4591] Current disk usage 72.76%. Max allowed age: 1.206748286096840days
I0919 15:53:54.071012 13005 slave.cpp:3732] master@10.142.55.196:5050 exited
W0919 15:53:54.071143 13005 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
I0919 15:54:01.105780 13002 slave.cpp:3732] master@10.142.55.196:5050 exited
W0919 15:54:01.105854 13002 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0919 15:54:01.105970 13010 process.cpp:2105] Failed to shutdown socket with fd 15: Transport endpoint is not connected
I0919 15:54:05.733837 13007 slave.cpp:3732] master@10.142.55.196:5050 exited
W0919 15:54:05.733932 13007 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0919 15:54:05.734071 13010 process.cpp:2105] Failed to shutdown socket with fd 15: Transport endpoint is not connected
E0919 15:54:05.818560 13010 process.cpp:2105] Failed to shutdown socket with fd 15: Transport endpoint is not connected
I0919 15:54:05.818583 13003 slave.cpp:3732] master@10.142.55.196:5050 exited
W0919 15:54:05.818758 13003 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
I0919 15:54:06.004385 13009 detector.cpp:152] Detected a new leader: (id='700')
I0919 15:54:06.004494 13009 group.cpp:706] Trying to get '/mesos/json.info_0000000700' in ZooKeeper
I0919 15:54:06.005511 13009 zookeeper.cpp:259] A new leading master (UPID=master@10.142.55.202:5050) is detected
I0919 15:54:06.005586 13009 slave.cpp:895] New master detected at master@10.142.55.202:5050
I0919 15:54:06.005609 13009 slave.cpp:916] No credentials provided. Attempting to register without authentication
I0919 15:54:06.005676 13009 slave.cpp:927] Detecting new master
I0919 15:54:06.005720 13009 status_update_manager.cpp:174] Pausing sending status updates
I0919 15:54:16.423193 13002 slave.cpp:4591] Current disk usage 72.76%. Max allowed age: 1.206699342406551days

感谢Joseph Wu帮我解决问题,详情:

有两条重复的日志消息(间接地)告诉您出了点问题:

I0919 15:55:08.178272 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (14)@10.142.55.202:5050

此消息表示您之前已经启动过此主机,并且工作目录相同。它的工作目录中有某种持久状态。

此日志消息告诉您有两个大师您之前没有启动过:

I0919 15:55:16.018023 13282 consensus.cpp:360] Aborting implicit promise request because 2 ignores received

master 将拒绝启动,因为具有持久状态的 master 少于法定人数。如果大师开始,您可能会丢失数据。这是预期的行为,因为 Mesos 在谨慎方面犯了错误。


如果我需要一个新的 mesos 集群,我需要 master 的干净工作目录。 但是问题并不像Joseph Wu所说的那样在10.142.55.202上。我清除了所有word_dir,并摆脱了这个问题。

如何清理工作目录:

  1. 找到 mesos-master 工作目录

    $ cat /etc/mesos-master/work_dir
    /var/lib/mesos
    
  2. 删除它

    $ rm -rf /var/lib/mesos