从一个节点到另一个节点的重传失败后,两个节点都将对方标记为已死亡,并且在 crm_mon 中不显示彼此的状态

After re-transmission failure from one node to another, both node mark each other as dead and does not show status of each other in crm_mon

因此在启动节点 1 时不显示节点 2,类似地节点 2 在 crm_mon 命令中不显示节点 1

在分析 corosync 日志后,我发现由于多次重传失败,两个节点都将彼此标记为已死,因此我尝试停止并启动 corosync 和起搏器,但它们仍然没有形成集群,并且在 crm_mon

Logs of Node 2:

For srv-vme-ccs-02

Oct 30 02:22:49 srv-vme-ccs-02 crmd[1973]: notice: crm_update_peer_state: plugin_handle_membership: Node srv-vme-ccs-01[2544637100] - state is now member (was (null)

It is member till now

Oct 30 10:07:34 srv-vme-ccs-02 corosync[1613]: [TOTEM ] Retransmit List: 117 Oct 30 10:07:35 srv-vme-ccs-02 corosync[1613]: [TOTEM ] Retransmit List: 118 Oct 30 10:07:35 srv-vme-ccs-02 corosync[1613]:
[TOTEM ] FAILED TO RECEIVE Oct 30 10:07:49 srv-vme-ccs-02 arpwatch: bogon 192.168.0.120
d4:be:d9:af:c6:23 Oct 30 10:07:59 srv-vme-ccs-02 corosync[1613]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 232: memb=1, new=0, lost=1 Oct 30 10:07:59 srv-vme-ccs-02 corosync[1613]: [pcmk ] info: pcmk_peer_update: memb: srv-vme-ccs-02 2561414316 Oct 30 10:07:59 srv-vme-ccs-02 corosync[1613]: [pcmk ] info: pcmk_peer_update: lost: srv-vme-ccs-01 2544637100 Oct 30 10:07:59 srv-vme-ccs-02 corosync[1613]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 232: memb=1, new=0, lost=0 Oct 30 10:07:59 srv-vme-ccs-02 corosync[1613]: [pcmk ] info: pcmk_peer_update: MEMB: srv-vme-ccs-02 2561414316 Oct 30 10:07:59 srv-vme-ccs-02 corosync[1613]: [pcmk ] info: ais_mark_unseen_peer_dead: Node srv-vme-ccs-01 was not seen in the previous transition Oct 30 10:07:59 srv-vme-ccs-02 corosync[1613]:
[pcmk ] info: update_member: Node 2544637100/srv-vme-ccs-01 is now: lost Oct 30 10:07:59 srv-vme-ccs-02 corosync[1613]: [pcmk ] info: send_member_notification: Sending membership update 232 to 2 children Oct 30 10:07:59 srv-vme-ccs-02 corosync[1613]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Oct 30 10:07:59 srv-vme-ccs-02 corosync[1613]: [CPG ] chosen downlist: sender r(0) ip(172.20.172.152) ; members(old:2 left:1) Oct 30 10:07:59 srv-vme-ccs-02 crmd[1973]: notice: plugin_handle_membership: Membership 232: quorum lost Oct 30 10:07:59 srv-vme-ccs-02 corosync[1613]: [MAIN ] Completed service synchronization, ready to provide service. Oct 30 10:07:59 srv-vme-ccs-02 cib[1968]: notice: plugin_handle_membership: Membership 232: quorum lost Oct 30 10:07:59 srv-vme-ccs-02 crmd[1973]: notice: crm_update_peer_state: plugin_handle_membership: Node srv-vme-ccs-01[2544637100] - state is now lost (was member) Oct 30 10:07:59 srv-vme-ccs-02 cib[1968]: notice: crm_update_peer_state: plugin_handle_membership: Node srv-vme-ccs-01[2544637100] - state is now lost (was member) Oct 30 10:07:59 srv-vme-ccs-02 crmd[1973]: warning: reap_dead_nodes: Our DC node (srv-vme-ccs-01) left the cluster

Now srv-vme-ccs-01 is no more a member

在另外一个节点上,我发现了类似重传失败的日志

Logs of Node 1

For srv-vme-ccs-01

Oct 30 09:48:32 [2000] srv-vme-ccs-01 pengine: info: determine_online_status: Node srv-vme-ccs-01 is online Oct 30 09:48:32 [2000] srv-vme-ccs-01 pengine: info: determine_online_status: Node srv-vme-ccs-02 is online

ct 30 09:48:59 [2001] srv-vme-ccs-01 crmd: info: update_dc: Unset DC. Was srv-vme-ccs-01 Oct 30 09:48:59 corosync [TOTEM ] Retransmit List: 107 108 109 10a 10b 10c 10d 10e 10f 110 111 112 113 114 115 116 117 Oct 30 09:48:59 corosync [TOTEM ] Retransmit List: 107 108 109 10a 10b 10c 10d 10e 10f 110 111 112 113 114 115 116 117 118

Oct 30 10:08:22 corosync [TOTEM ] A processor failed, forming new configuration. Oct 30 10:08:25 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 232: memb=1, new=0, lost=1 Oct 30 10:08:25 corosync [pcmk ] info: pcmk_peer_update: memb: srv-vme-ccs-01 2544637100 Oct 30 10:08:25 corosync [pcmk ] info: pcmk_peer_update: lost: srv-vme-ccs-02 2561414316 Oct 30 10:08:25 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 232: memb=1, new=0, lost=0 Oct 30 10:08:25 corosync [pcmk ] info: pcmk_peer_update: MEMB: srv-vme-ccs-01 2544637100 Oct 30 10:08:25 corosync [pcmk ] info: ais_mark_unseen_peer_dead: Node srv-vme-ccs-02 was not seen in the previous transition Oct 30 10:08:25 corosync [pcmk ] info: update_member: Node 2561414316/srv-vme-ccs-02 is now: lost Oct 30 10:08:25 corosync [pcmk ] info: send_member_notification: Sending membership update 232 to 2 children Oct 30 10:08:25 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Oct 30 10:08:25 [1996] srv-vme-ccs-01 cib: notice: plugin_handle_membership:
Membership 232: quorum lost Oct 30 10:08:25 [1996] srv-vme-ccs-01
cib: notice: crm_update_peer_state: plugin_handle_membership: Node srv-vme-ccs-02[2561414316] - state is now lost (was member) Oct 30 10:08:25 corosync [CPG ] chosen downlist: sender r(0) ip(172.20.172.151) ; members(old:2 left:1) Oct 30 10:08:25 [2001] srv-vme-ccs-01 crmd: notice: plugin_handle_membership:
Membership 232: quorum lost Oct 30 10:08:25 [2001] srv-vme-ccs-01
crmd: notice: crm_update_peer_state: plugin_handle_membership: Node srv-vme-ccs-02[2561414316] - state is now lost (was member) Oct 30 10:08:25 [2001] srv-vme-ccs-01 crmd: info: peer_update_callback: srv-vme-ccs-02 is now lost (was member) Oct 30 10:08:25 corosync [MAIN ] Completed service synchronization, ready to provide service. Oct 30 10:08:25 [2001] srv-vme-ccs-01
crmd: warning: match_down_event: No match for shutdown action on srv-vme-ccs-02 Oct 30 10:08:25 [1990] srv-vme-ccs-01 pacemakerd:
info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=9): Try again (6)

Oct 30 10:08:25 [2001] srv-vme-ccs-01 crmd: info: join_make_offer: Skipping srv-vme-ccs-01: already known 1 Oct 30 10:08:25 [2001] srv-vme-ccs-01 crmd: info: update_dc: Set DC to srv-vme-ccs-01 (3.0.7) Oct 30 10:08:25 [1996] srv-vme-ccs-01
cib: info: cib_process_request: Completed cib_modify operation for section crm_config: OK (rc=0, origin=local/crmd/185, version=0.116.3)

所以同时在两个节点上重传消息发生很严重(它发生在服务器突然重启后)并且两个节点都将彼此标记为丢失的成员并形成单独的集群将自己标记为 DC

我得到了解决方案:

首先在 tcpdump 中检查 pacemkaer 正在使用多播,在与网络团队进行调查后,我们了解到未启用多播。

所以当我们删除 mcastaddere 并重新启动 corosync 和 pacemaker 时,但是 corosyn 拒绝启动并说错误:

corosync.conf 中没有定义 mcastaddress。

最后一次调试发现

的 synaxt

transport: udpu

不正确,作者如下:

transport=udpu

因此,corosync 默认 运行 是多播模式。

所以,问题在更正后得到解决 corosync.conf。