Pacemaker 无法在 postgres-11 上启动从节点

Pacemaker not able to start slave node on postgres-11

我有 2 个节点(分别命名为 node03 和 node04),主从、热备用设置使用 pacemaker 来管理集群。切换前,node04为主,03为备。 切换后,我一直想把node04重新拉回来做从节点,但是做不到。

在切换期间,我意识到有人更改了配置文件并将 ignore_system_indexes 参数设置为 true。我不得不删除它并手动重新启动 postgres 服务器。正是在这之后,集群开始表现异常。

可以手动将 node04 备份为从节点,即,如果我手动启动 postgres 实例并使用 recovery.conf 文件。

以下是了解情况所需的文件:

sudo crm_mon -A1f
Stack: corosync
Current DC: node03 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum

Node node04: standby
Online: [ node03 ]

Active resources:

 Resource Group: master-group
     vip-repli  (ocf::heartbeat:IPaddr2):       Started node03
     vip-master (ocf::heartbeat:IPaddr2):       Started node03
 Master/Slave Set: pgsql-cluster [pgsqlins]
     Masters: [ node03 ]

Node Attributes:
* Node node03:
    + master-pgsqlins                   : 1000
    + pgsqlins-data-status              : LATEST
    + pgsqlins-master-baseline          : 00008820DC000098
    + pgsqlins-status                   : PRI
* Node node04:
    + master-pgsqlins                   : -INFINITY
    + pgsqlins-data-status              : DISCONNECT
    + pgsqlins-status                   : STOP

Migration Summary:
* Node node03:
* Node node04:

recovery.conf

primary_conninfo = 'host=1xx.xx.xx.xx port=5432 user=replica application_name=node04 keepalives_idle=60 keepalives_interval=5 keepalives_count=5'
restore_command = 'rsync -a /Dxxxxx1/wal_archive/%f %p'
recovery_target_timeline = 'latest'
standby_mode = 'on'

集群cib

sudo pcs cluster cib
<cib crm_feature_set="3.0.14" validate-with="pacemaker-2.10" epoch="269" num_updates="4" admin_epoch="0" cib-last-written="Mon Jun 28 15:13:35 2021" update-origin="node04" update-client="crmd" update-user="hacluster" have-quorum="1" dc-uuid="1">
  <configuration>
    <crm_config>
      <cluster_property_set id="cib-bootstrap-options">
        <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="false"/>
        <nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" value="ignore"/>
        <nvpair id="cib-bootstrap-options-have-watchdog" name="have-watchdog" value="false"/>
        <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.23-1.el7_9.1-9acf116022"/>
        <nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="corosync"/>
        <nvpair id="cib-bootstrap-options-cluster-name" name="cluster-name" value="pgcluster"/>
        <nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1624860815"/>
      </cluster_property_set>
    </crm_config>
    <nodes>
      <node id="1" uname="node03">
        <instance_attributes id="nodes-1">
          <nvpair id="nodes-1-pgsqlins-data-status" name="pgsqlins-data-status" value="LATEST"/>
        </instance_attributes>
      </node>
      <node id="2" uname="node04">
        <instance_attributes id="nodes-2">
          <nvpair id="nodes-2-pgsqlins-data-status" name="pgsqlins-data-status" value="DISCONNECT"/>
          <nvpair id="nodes-2-standby" name="standby" value="on"/>
        </instance_attributes>
      </node>
    </nodes>
    <resources>
      <group id="master-group">
        <primitive class="ocf" id="vip-repli" provider="heartbeat" type="IPaddr2">
          <instance_attributes id="vip-repli-instance_attributes">
            <nvpair id="vip-repli-instance_attributes-cidr_netmask" name="cidr_netmask" value="24"/>
            <nvpair id="vip-repli-instance_attributes-ip" name="ip" value="1xx.xx.xx.xx"/>
            <nvpair id="vip-repli-instance_attributes-nic" name="nic" value="eth2"/>
          </instance_attributes>
          <operations>
            <op id="vip-repli-monitor-interval-10s" interval="10s" name="monitor" timeout="20s"/>
            <op id="vip-repli-start-interval-0s" interval="0s" name="start" timeout="20s"/>
            <op id="vip-repli-stop-interval-0s" interval="0s" name="stop" timeout="20s"/>
          </operations>
        </primitive>
        <primitive class="ocf" id="vip-master" provider="heartbeat" type="IPaddr2">
          <instance_attributes id="vip-master-instance_attributes">
            <nvpair id="vip-master-instance_attributes-cidr_netmask" name="cidr_netmask" value="24"/>
            <nvpair id="vip-master-instance_attributes-ip" name="ip" value="1x.xx.xxx.xxx"/>
            <nvpair id="vip-master-instance_attributes-nic" name="nic" value="eth1"/>
          </instance_attributes>
          <operations>
            <op id="vip-master-monitor-interval-10s" interval="10s" name="monitor" timeout="20s"/>
            <op id="vip-master-start-interval-0s" interval="0s" name="start" timeout="20s"/>
            <op id="vip-master-stop-interval-0s" interval="0s" name="stop" timeout="20s"/>
          </operations>
        </primitive>
      </group>
      <master id="pgsql-cluster">
        <primitive class="ocf" id="pgsqlins" provider="heartbeat" type="pgsql11">
          <instance_attributes id="pgsqlins-instance_attributes">
            <nvpair id="pgsqlins-instance_attributes-master_ip" name="master_ip" value="1xx.xx.xx.xx"/>
            <nvpair id="pgsqlins-instance_attributes-node_list" name="node_list" value="node03 node04"/>
            <nvpair id="pgsqlins-instance_attributes-pgctl" name="pgctl" value="/usr/pgsql-11/bin/pg_ctl"/>
            <nvpair id="pgsqlins-instance_attributes-pgdata" name="pgdata" value="/DPxxxx01/datadg/data"/>
            <nvpair id="pgsqlins-instance_attributes-pgport" name="pgport" value="5432"/>
            <nvpair id="pgsqlins-instance_attributes-primary_conninfo_opt" name="primary_conninfo_opt" value="keepalives_idle=60 keepalives_interval=5 keepalives_count=5"/>
            <nvpair id="pgsqlins-instance_attributes-psql" name="psql" value="/usr/pgsql-11/bin/psql"/>
            <nvpair id="pgsqlins-instance_attributes-rep_mode" name="rep_mode" value="sync"/>
            <nvpair id="pgsqlins-instance_attributes-repuser" name="repuser" value="replica"/>
            <nvpair id="pgsqlins-instance_attributes-restart_on_promote" name="restart_on_promote" value="true"/>
            <nvpair id="pgsqlins-instance_attributes-restore_command" name="restore_command" value="rsync -a /Dxxxxx01/wal_archive/%f %p"/>
          </instance_attributes>
          <operations>
            <op id="pgsqlins-demote-interval-0" interval="0" name="demote" on-fail="stop" timeout="60s"/>
            <op id="pgsqlins-methods-interval-0s" interval="0s" name="methods" timeout="5s"/>
            <op id="pgsqlins-monitor-interval-10s" interval="10s" name="monitor" on-fail="restart" timeout="60s"/>
            <op id="pgsqlins-monitor-interval-9s" interval="9s" name="monitor" on-fail="restart" role="Master" timeout="60s"/>
            <op id="pgsqlins-notify-interval-0" interval="0" name="notify" timeout="60s"/>
            <op id="pgsqlins-promote-interval-0" interval="0" name="promote" on-fail="restart" timeout="60s"/>
            <op id="pgsqlins-start-interval-0" interval="0" name="start" on-fail="restart" timeout="60s"/>
            <op id="pgsqlins-stop-interval-0" interval="0" name="stop" on-fail="block" timeout="60s"/>
          </operations>
        </primitive>
        <meta_attributes id="pgsql-cluster-meta_attributes">
          <nvpair id="pgsql-cluster-meta_attributes-master-node-max" name="master-node-max" value="1"/>
          <nvpair id="pgsql-cluster-meta_attributes-clone-max" name="clone-max" value="2"/>
          <nvpair id="pgsql-cluster-meta_attributes-notify" name="notify" value="true"/>
          <nvpair id="pgsql-cluster-meta_attributes-master-max" name="master-max" value="1"/>
          <nvpair id="pgsql-cluster-meta_attributes-clone-node-max" name="clone-node-max" value="1"/>
        </meta_attributes>
      </master>
    </resources>
    <constraints>
      <rsc_colocation id="colocation-master-group-pgsql-cluster-INFINITY" rsc="master-group" score="INFINITY" with-rsc="pgsql-cluster" with-rsc-role="Master"/>
      <rsc_order first="pgsql-cluster" first-action="promote" id="order-pgsql-cluster-master-group-INFINITY" score="INFINITY" symmetrical="false" then="master-group" then-action="start"/>
      <rsc_order first="pgsql-cluster" first-action="demote" id="order-pgsql-cluster-master-group-0" score="0" symmetrical="false" then="master-group" then-action="stop"/>
      <rsc_location id="cli-prefer-pgsql-cluster" rsc="pgsql-cluster" role="Started" node="node04" score="INFINITY"/>
    </constraints>
  </configuration>
  <status>
    <node_state id="1" uname="node03" in_ccm="true" crmd="online" crm-debug-origin="do_update_resource" join="member" expected="member">
      <transient_attributes id="1">
        <instance_attributes id="status-1">
          <nvpair id="status-1-pgsqlins-status" name="pgsqlins-status" value="PRI"/>
          <nvpair id="status-1-master-pgsqlins" name="master-pgsqlins" value="1000"/>
          <nvpair id="status-1-pgsqlins-master-baseline" name="pgsqlins-master-baseline" value="00008820DC000098"/>
        </instance_attributes>
      </transient_attributes>
      <lrm id="1">
        <lrm_resources>
          <lrm_resource id="vip-master" type="IPaddr2" class="ocf" provider="heartbeat">
            <lrm_rsc_op id="vip-master_last_0" operation_key="vip-master_start_0" operation="start" crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" transition-key="3:433:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" transition-magic="0:0;3:433:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" exit-reason="" on_node="node03" call-id="535" rc-code="0" op-status="0" interval="0" last-run="1624859077" last-rc-change="1624859077" exec-time="90" queue-time="0" op-digest="38fc1b2633211138e53cb349a5c147ff"/>
            <lrm_rsc_op id="vip-master_monitor_10000" operation_key="vip-master_monitor_10000" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" transition-key="4:433:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" transition-magic="0:0;4:433:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" exit-reason="" on_node="node03" call-id="536" rc-code="0" op-status="0" interval="10000" last-rc-change="1624859077" exec-time="72" queue-time="0" op-digest="4cbf56ab9e52c6f07a7be8cbb786451c"/>
          </lrm_resource>
          <lrm_resource id="vip-repli" type="IPaddr2" class="ocf" provider="heartbeat">
            <lrm_rsc_op id="vip-repli_last_0" operation_key="vip-repli_start_0" operation="start" crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" transition-key="1:433:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" transition-magic="0:0;1:433:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" exit-reason="" on_node="node03" call-id="532" rc-code="0" op-status="0" interval="0" last-run="1624859077" last-rc-change="1624859077" exec-time="127" queue-time="0" op-digest="dd04ed3322c75b7bab13c5bea56dbe77"/>
            <lrm_rsc_op id="vip-repli_monitor_10000" operation_key="vip-repli_monitor_10000" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" transition-key="2:433:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" transition-magic="0:0;2:433:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" exit-reason="" on_node="node03" call-id="534" rc-code="0" op-status="0" interval="10000" last-rc-change="1624859077" exec-time="55" queue-time="0" op-digest="c76770c29a91fb082fdf1fdd8b0469c3"/>
          </lrm_resource>
          <lrm_resource id="pgsqlins" type="pgsql11" class="ocf" provider="heartbeat">
            <lrm_rsc_op id="pgsqlins_last_0" operation_key="pgsqlins_promote_0" operation="promote" crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" transition-key="12:432:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" transition-magic="0:0;12:432:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" exit-reason="" on_node="node03" call-id="530" rc-code="0" op-status="0" interval="0" last-run="1624859073" last-rc-change="1624859073" exec-time="3307" queue-time="0" op-digest="2f51441ed087061eb68745fd8157ddb6"/>
            <lrm_rsc_op id="pgsqlins_monitor_9000" operation_key="pgsqlins_monitor_9000" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" transition-key="13:433:8:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" transition-magic="0:8;13:433:8:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" exit-reason="" on_node="node03" call-id="533" rc-code="8" op-status="0" interval="9000" last-rc-change="1624859078" exec-time="497" queue-time="1" op-digest="978aa48a7da35944c793e174dbee9a1d"/>
          </lrm_resource>
        </lrm_resources>
      </lrm>
    </node_state>
    <node_state id="2" uname="node04" in_ccm="true" crmd="online" crm-debug-origin="do_update_resource" join="member" expected="member">
      <lrm id="2">
        <lrm_resources>
          <lrm_resource id="vip-repli" type="IPaddr2" class="ocf" provider="heartbeat">
            <lrm_rsc_op id="vip-repli_last_0" operation_key="vip-repli_monitor_0" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" transition-key="4:1:7:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" transition-magic="0:7;4:1:7:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" exit-reason="" on_node="node04" call-id="5" rc-code="7" op-status="0" interval="0" last-run="1624600624" last-rc-change="1624600624" exec-time="65" queue-time="0" op-digest="dd04ed3322c75b7bab13c5bea56dbe77"/>
          </lrm_resource>
          <lrm_resource id="vip-master" type="IPaddr2" class="ocf" provider="heartbeat">
            <lrm_rsc_op id="vip-master_last_0" operation_key="vip-master_monitor_0" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" transition-key="5:1:7:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" transition-magic="0:7;5:1:7:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" exit-reason="" on_node="node04" call-id="9" rc-code="7" op-status="0" interval="0" last-run="1624600624" last-rc-change="1624600624" exec-time="62" queue-time="0" op-digest="38fc1b2633211138e53cb349a5c147ff"/>
          </lrm_resource>
          <lrm_resource id="pgsqlins" type="pgsql11" class="ocf" provider="heartbeat">
            <lrm_rsc_op id="pgsqlins_last_0" operation_key="pgsqlins_monitor_0" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" transition-key="4:436:7:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" transition-magic="0:7;4:436:7:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" exit-reason="" on_node="node04" call-id="192" rc-code="7" op-status="0" interval="0" last-run="1624860816" last-rc-change="1624860816" exec-time="178" queue-time="0" op-digest="2f51441ed087061eb68745fd8157ddb6"/>
          </lrm_resource>
        </lrm_resources>
      </lrm>
      <transient_attributes id="2">
        <instance_attributes id="status-2">
          <nvpair id="status-2-pgsqlins-status" name="pgsqlins-status" value="STOP"/>
          <nvpair id="status-2-master-pgsqlins" name="master-pgsqlins" value="-INFINITY"/>
        </instance_attributes>
      </transient_attributes>
    </node_state>
  </status>
</cib>

如果我尝试取消待机 node04,它会先降级 node03,然后尝试启动 node04,尽管 node04 没有出现。我试过只带 node04 一个人,但也失败了。 但是,如果我尝试从上述情况手动启动 node04,我可以做到。如果我尝试清理 pgsqlins 资源,它会失败。

这里是corosync.log

8 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_process_request:  Forwarding cib_apply_diff operation for section 'all' to all (origin=local/ci
badmin/2)
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       Diff: --- 0.251.32 2
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       Diff: +++ 0.252.0 b956759712580c1bfdffd25cbf4ab8e9
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       -- /cib/configuration/nodes/node[@id='2']/instance_attributes[@id='nodes-2']/
nvpair[@id='nodes-2-standby']
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       +  /cib:  @epoch=252, @num_updates=0
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_process_request:  Completed cib_apply_diff operation for section 'all': OK (rc=0, origin=dci2pg
s04/cibadmin/2, version=0.252.0)
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_file_backup:      Archived previous version as /var/lib/pacemaker/cib/cib-60.raw
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_file_write_with_digest:   Wrote version 0.252.0 of the CIB to disk (digest: 8b99629d323c923de59
2700bc4398c49)
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_file_write_with_digest:   Reading cluster configuration file /var/lib/pacemaker/cib/cib.ZtvQXP
(digest: /var/lib/pacemaker/cib/cib.fh4Toy)
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       Diff: --- 0.252.0 2
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       Diff: +++ 0.252.1 (null)
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       +  /cib:  @num_updates=1
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       +  /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@i
d='pgsqlins']/lrm_rsc_op[@id='pgsqlins_last_0']:  @operation_key=pgsqlins_demote_0, @operation=demote, @transition-key=10:396:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04, @transi
tion-magic=-1:193;10:396:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04, @call-id=-1, @rc-code=193, @op-status=-1, @last-run=1624852894, @last-rc-change=1624852894, @exec-time=0
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_process_request:  Completed cib_modify operation for section status: OK (rc=0, origin=node03
/crmd/948, version=0.252.1)
Jun 28 13:01:34 [9294] node04.dc.japannext.co.jp      attrd:     info: attrd_peer_update:    Setting master-pgsqlins[node03]: 1000 -> -INFINITY from node03
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       Diff: --- 0.252.1 2
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       Diff: +++ 0.252.2 (null)
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       +  /cib:  @num_updates=2
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       +  /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_att
ributes[@id='status-1']/nvpair[@id='status-1-master-pgsqlins']:  @value=-INFINITY
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_process_request:  Completed cib_modify operation for section status: OK (rc=0, origin=node03
/attrd/211, version=0.252.2)
Jun 28 13:01:34 [9294] node04.dc.japannext.co.jp      attrd:     info: attrd_peer_update:    Setting pgsqlins-master-baseline[node03]: 00008820CC000098 -> (null) from node03
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       Diff: --- 0.252.2 2
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       Diff: +++ 0.252.3 (null)
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       -- /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-pgsqlins-master-baseline']
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       +  /cib:  @num_updates=3
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_process_request:  Completed cib_modify operation for section status: OK (rc=0, origin=node03/attrd/212, version=0.252.3)
Jun 28 13:01:35 [9294] node04.dc.japannext.co.jp      attrd:     info: attrd_peer_update:    Setting pgsqlins-status[node03]: PRI -> STOP from node03
.
.
.
Jun 28 13:01:36 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       +  /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='pgsqlins']/lrm_rsc_op[@id='pgsqlins_last_0']:  @transition-magic=0:0;9:397:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04, @call-id=445, @rc-code=0, @op-status=0, @exec-time=471
Jun 28 13:01:36 [9291] node04.dc.japannext.co.jp        cib:     info: cib_process_request:  Completed cib_modify operation for section status: OK (rc=0, origin=node03/crmd/956, version=0.252.11)
Jun 28 13:01:36 [9296] node04.dc.japannext.co.jp       crmd:     info: do_lrm_rsc_op:        Performing key=10:397:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04 op=pgsqlins_start_0
Jun 28 13:01:36 [9291] node04.dc.japannext.co.jp        cib:     info: cib_process_request:  Forwarding cib_modify operation for section status to all (origin=local/crmd/142)
Jun 28 13:01:36 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       Diff: --- 0.252.11 2
Jun 28 13:01:36 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       Diff: +++ 0.252.12 (null)
Jun 28 13:01:36 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       +  /cib:  @num_updates=12
Jun 28 13:01:36 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       +  /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='pgsqlins']/lrm_rsc_op[@id='pgsqlins_last_0']:  @operation_key=pgsqlins_start_0, @operation=start, @transition-key=12:397:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04, @transition-magic=-1:193;12:397:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04, @call-id=-1, @rc-code=193, @op-status=-1, @exec-time=0
Jun 28 13:01:36 [9293] node04.dc.japannext.co.jp       lrmd:     info: log_execute:  executing - rsc:pgsqlins action:start call_id:132
Jun 28 13:01:36 [9291] node04.dc.japannext.co.jp        cib:     info: cib_process_request:  Completed cib_modify operation for section status: OK (rc=0, origin=node03/crmd/957, version=0.252.12)
Jun 28 13:01:36 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       Diff: --- 0.252.12 2
Jun 28 13:01:36 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       Diff: +++ 0.252.13 (null)
Jun 28 13:01:36 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       +  /cib:  @num_updates=13
Jun 28 13:01:36 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       +  /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='pgsqlins']/lrm_rsc_op[@id='pgsqlins_last_0']:  @operation_key=pgsqlins_start_0, @operation=start, @transition-key=10:397:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04, @transition-magic=-1:193;10:397:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04, @call-id=-1, @rc-code=193, @op-status=-1, @last-run=1624852896, @last-rc-change=1624852896, @exec-time=0
Jun 28 13:01:36 [9291] node04.dc.japannext.co.jp        cib:     info: cib_process_request:  Completed cib_modify operation for section status: OK (rc=0, origin=node04/crmd/142, version=0.252.13)
Jun 28 13:01:37  pgsql11(pgsqlins)[9613]:    INFO: Set all nodes into async mode.
Jun 28 13:01:37  pgsql11(pgsqlins)[9613]:    INFO: PostgreSQL is down
Jun 28 13:01:37  pgsql11(pgsqlins)[9613]:    INFO: server starting
Jun 28 13:01:37  pgsql11(pgsqlins)[9613]:    INFO: PostgreSQL start command sent.
Jun 28 13:01:37  pgsql11(pgsqlins)[9613]:    WARNING: Can't get PostgreSQL recovery status. rc=2

我的猜测是起搏器在从 /var/lib/pacemaker/cib 切换之前读取设置并使用它来执行这些步骤。任何有关如何重置它的帮助将不胜感激。

  • 正如 pacemaker 问题中提到的,将 node04 置于非待机状态时,pacemaker 正在降级 node03 并试图让 node04 成为主服务器。它会在此任务中失败,然后将 node03 作为独立主服务器。

  • 因为我怀疑它是从 cibpengine 文件夹中选择一些旧配置,我什至破坏了两个节点上的集群,删除了 pacemaker、pcs 和 corosync并重新安装所有这些。

  • 尽管如此,问题仍然存在。然后怀疑是不是node04上的/var/lib/pgsql/文件夹的文件夹权限可能不对,于是开始摸索。

  • 这时候我才知道有一个旧的PGSQL.lock.bak文件,日期是6月11日,也就是说它比PGSQL.lock中的当前PGSQL.lock文件旧11=],因此 pacemaker 试图提升 node04 但会失败。 Pacemaker 不会在任何日志中将此显示为错误。即使在 crm_mon 输出中也没有关于它的信息。一旦我删除了这个文件,它就像一个魅力。

TLDR;

  • 检查 /var/lib/pgsql/tmp 文件夹中是否有任何 PGSQL.lock.bak 或任何其他不需要的文件,并在再次启动起搏器之前将其删除。