无法启动 Mesos/Marathon 群集

Cannot Start Mesos/Marathon Cluster

物理机:192.168.10.1(Mesos、Zookeeper、Marathon)
虚拟机:192.168.122.10(Mesos、Zookeeper)
虚拟机:192.168.122.46(Mesos、Zookeeper)

OS三台机器都是Fedora 23 Server

默认情况下,这两个网络已经相互路由,因为虚拟机都驻留在物理机上。

没有设置防火墙。

Mesos 选举日志:

Master bound to loopback interface! Cannot communicate with remote schedulers or slaves. You might want to set '--ip' flag to a routable IP address.

我可以手动设置,但是我不能动态设置...无法识别 --ip_discovery_command 标志。

我想做的是 link 该标志的以下脚本。

if [[ $(ip addr) == *enp8s0* ]]; 
then 
    ip addr show enp8s0 | awk -F'/| ' '/inet/ { print  }'
else 
    ip addr show eth0 | awk -F'/| ' '/inet/ { print  }'
fi

当我手动设置时(不是我想做的)...

IP:5050 的 Mesos 页面出现...但是由于这个原因,mesos-master 在 1 分钟后失败...

F0427 17:03:27.975260  6914 master.cpp:1253] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
*** Check failure stack trace: ***
    @     0x7f8360fa9edd  (unknown)
    @     0x7f8360fabc50  (unknown)
    @     0x7f8360fa9ad3  (unknown)
    @     0x7f8360fac61e  (unknown)
    @     0x7f83619a85dd  (unknown)
    @     0x7f83619e7c30  (unknown)
    @     0x55a885ee3b2e  (unknown)
    @     0x7f8361a11c0e  (unknown)
    @     0x7f8361a5d75e  (unknown)
    @     0x7f8361a7077a  (unknown)
    @     0x7f83618f4aae  (unknown)
    @     0x7f8361a70768  (unknown)
    @     0x7f8361a548d0  (unknown)
    @     0x7f8361fc832c  (unknown)
    @     0x7f8361fd42a5  (unknown)
    @     0x7f8361fd472f  (unknown)
    @     0x7f8360a5e60a  start_thread
    @     0x7f835fefda4d  __clone Aborted (core dumped)

Zookeeper 的设置如下:

# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
dataDir=/var/lib/zookeeper/data
dataLogDir=/var/lib/zookeeper/log
# the port at which the clients will connect
clientPort=2181
# the maximum number of client connections.
# increase this if you need to handle more clients
#maxClientCnxns=60
#
# Be sure to read the maintenance section of the
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1
server.1:192.168.10.1:2888:3888
server.2:192.168.122.46:2888:3888
server.3:192.168.122.10:2888:3888

不知道如何验证它是否正常工作...

老实说,我已经筋疲力尽了.. 由于文档不完善和缺乏适当的体系结构解释(主要是马拉松),组织糟糕的日志(Mesos),systemd无法正确解析 bash 并将输出用作变量,并且周围缺少说明。

我是不是做错了什么?我很感激我能得到的任何帮助,如果您需要我尚未提供的任何东西,请告诉我,我会立即 post。

编辑:

我解决了 marathon 的问题,方法是向 VM 添加两个额外的 Marathon 服务器,以便它们可以形成法定人数。

编辑2:

我现在遇到 Mesos 服务器不断快速重新选举领导者的问题...但根据结果我稍后会调查...

如果您密切关注 installation docs,我认为您应该可以使用它。

例如,您的“Master binds to loopback”问题是恕我直言,与 incorrect/incomplete 设置有关。参见:

Hostname (optional)

If you're unable to resolve the hostname of the machine directly (e.g., if on a different network or using a VPN), set /etc/mesos-master/hostname to a value that you can resolve, for example, an externally accessible IP address or DNS hostname. This will ensure all links from the Mesos console work correctly.

You will also want to set this property in /etc/marathon/conf/hostname.

此外,我还建议在 /etc/mesos-master/ip 文件中设置主 IP 地址。始终确保主机名可解析为非本地 IP 地址,即通过在每个主机上的 /etc/hosts 文件中添加条目。

基本上,/etc/hosts 文件应该与此类似(将主机名替换为实际主机名):

127.0.0.1 localhost

192.168.10.1 host1
192.168.122.10 host2
192.168.122.46 host3

如果您只想测试 Mesos 集群,您也可以使用预配置的 Vagrant 解决方案,例如 tobilg/coreos-mesos-cluster

关于 ZooKeeper 设置,请确保您在每个节点上创建了一个 /var/lib/zookeeper/myid,其中包含您为每个节点设置的实际数字 ID,例如对于 192.168.10.1,文件的唯一内容需要是 1

在调试masters之前,检查ZooKeeper集群是否正常工作,是否选出leader。确保 /etc/mesos/zk 在每个主机上包含正确的 ZooKeeper 连接字符串,例如

zk://192.168.10.1:2181,192.168.122.10:2181,192.168.122.46:2181/mesos

如果 ZK 有效,则重新启动服务并检查 Masters 日志。对奴隶做同样的事情。

参考文献: