Ceph Monitor 超出法定人数

Ceph Monitor out of quorum

我们的一个 ceph 监视器出现问题。集群使用了 3 个监视器,它们都已启动&运行。它们可以相互通信并给出相关的 ceph -s 输出。但是法定人数显示第二个监视器已关闭。应该关闭的监视器的 ceph -s 输出如下:

cluster:
    id:     bb1ab46a-d282-4530-bf5c-021e9c940958
    health: HEALTH_WARN
            insufficient standby MDS daemons available
            noout flag(s) set
            9 large omap objects
            47 pgs not deep-scrubbed in time
            application not enabled on 2 pool(s)
            1/3 mons down, quorum mon1,mon3

  services:
    mon:        3 daemons, quorum mon1,mon3 (age 3d), out of quorum: mon2
    mgr:        mon1(active, since 3d)
    mds:        filesystem:1 {0=mon1=up:active}
    osd:        77 osds: 77 up (since 3d), 77 in (since 2w)
                flags noout
    rbd-mirror: 1 daemon active (12512649)
    rgw:        1 daemon active (mon1)

  data:
    pools:   13 pools, 1500 pgs
    objects: 65.36M objects, 23 TiB
    usage:   85 TiB used, 701 TiB / 785 TiB avail
    pgs:     1500 active+clean

  io:
    client:   806 KiB/s wr, 0 op/s rd, 52 op/s wr

systemctl status ceph-mon@2.service 显示:

ceph-mon@2.service - Ceph cluster monitor daemon
   Loaded: loaded (/usr/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: disabled)
   Active: failed (Result: start-limit) since Tue 2020-12-08 12:12:58 +03; 28s ago
  Process: 2681 ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
 Main PID: 2681 (code=exited, status=1/FAILURE)

Dec 08 12:12:48 mon2 systemd[1]: Unit ceph-mon@2.service entered failed state.
Dec 08 12:12:48 mon2 systemd[1]: ceph-mon@2.service failed.
Dec 08 12:12:58 mon2 systemd[1]: ceph-mon@2.service holdoff time over, scheduling restart.
Dec 08 12:12:58 mon2 systemd[1]: Stopped Ceph cluster monitor daemon.
Dec 08 12:12:58 mon2 systemd[1]: start request repeated too quickly for ceph-mon@2.service
Dec 08 12:12:58 mon2 systemd[1]: Failed to start Ceph cluster monitor daemon.
Dec 08 12:12:58 mon2 systemd[1]: Unit ceph-mon@2.service entered failed state.
Dec 08 12:12:58 mon2 systemd[1]: ceph-mon@2.service failed.

正在重新启动,Stop/Starting,Enable/Disabling 监视器守护程序不工作。文档在 var/run/ceph 中提到了监视器 asok 文件,我没有在假定的目录中,但其他监视器的 asok 文件就在适当的位置。现在我处于一种状态,我什至无法停止第二个监视器上的监视器守护程序,它只停留在失败状态。 /var/log/ceph 监控日志中没有显示任何日志。我应该做些什么?我在 ceph 方面没有太多经验,所以我不想在没有绝对确定的情况下进行更改,以免弄乱集群。

尝试在 MON2 上手动启动服务:

/usr/bin/ceph-mon -f --cluster Ceph --id 2 --setuser ceph --setgroup ceph