SQL 2017 Linux ag 资源未通过起搏器进行故障转移
SQL 2017 Linux ag resource not failing over with pacemaker
我们在 MS Documentation 之后的 linux 上设置了一个 sql2017 集群。 AG 中的复制工作正常,但我们无法进行故障转移。如果我在故障转移过程中查看日志,pacemaker 正在尝试移动 AG,但它失败并继续 运行ning 在主服务器上。
在 master 上它报告资源不是 运行ning。
Oct 01 15:06:56 [4346] syncdb01a-stag lrmd: notice: operation_finished: ttsyncagresource_monitor_11000:6280:stderr [ resource ttsyncagresource is NOT running ]
在辅助服务器上我看到这个未知错误:
Oct 01 15:06:57 [24662] syncdb01b-stag pengine: warning: unpack_rsc_op_failure: Processing failed start of ttsyncagresource:1 on syncdb01b-stag: unknown error | rc=1
如果我 运行 pcs status
我得到以下结果。它显示的最新错误是如果我关闭主节点会发生什么。其他两个错误是由于 sql 权限导致的,已解决。
[root@syncdb01a-stag oper]# pcs status
Cluster name: syncdb-stag
Stack: corosync
Current DC: syncdb01b-stag (version 1.1.20-5.el7_7.1-3c4c782f70) - partition with quorum
Last updated: Tue Oct 1 20:36:32 2019
Last change: Tue Oct 1 15:53:57 2019 by root via crm_resource on syncdb01a-stag
3 nodes configured
3 resources configured
Online: [ syncdb01a-stag syncdb01b-stag syncwit01-stag ]
Full list of resources:
Master/Slave Set: ttsyncagresource-master [ttsyncagresource]
Masters: [ syncdb01a-stag ]
Stopped: [ syncdb01b-stag syncwit01-stag ]
Failed Resource Actions:
* ttsyncagresource_monitor_11000 on syncdb01a-stag 'not running' (7): call=17, status=complete, exitreason='',
last-rc-change='Tue Oct 1 15:03:47 2019', queued=0ms, exec=0ms
* ttsyncagresource_start_0 on syncdb01b-stag 'unknown error' (1): call=17, status=complete, exitreason='2019/10/01 14:43:30 Did not find AG row in sys.availability_groups',
last-rc-change='Tue Oct 1 14:43:25 2019', queued=0ms, exec=5255ms
* ttsyncagresource_start_0 on syncwit01-stag 'unknown error' (1): call=17, status=complete, exitreason='2019/10/01 14:43:30 Did not find AG row in sys.availability_groups',
last-rc-change='Tue Oct 1 14:43:25 2019', queued=1ms, exec=5228ms
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
我还删除了所有限制(由于是多子网,我们没有使用虚拟 ip)
[root@syncdb01a-stag oper]# pcs constraint
Location Constraints:
Ordering Constraints:
Colocation Constraints:
Ticket Constraints:
这是 pcs config
的输出:
[root@syncdb01a-stag oper]# pcs config
Cluster Name: syncdb-stag
Corosync Nodes:
syncdb01a-stag syncdb01b-stag syncwit01-stag
Pacemaker Nodes:
syncdb01a-stag syncdb01b-stag syncwit01-stag
Resources:
Master: ttsyncagresource-master
Meta Attrs: notify=true
Resource: ttsyncagresource (class=ocf provider=mssql type=ag)
Attributes: ag_name=ttsyncag
Meta Attrs: failure=timeout=60s notify=true
Operations: demote interval=0s timeout=10 (ttsyncagresource-demote-interval-0s)
monitor interval=10 timeout=60 (ttsyncagresource-monitor-interval-10)
monitor interval=11 role=Master timeout=60 (ttsyncagresource-monitor-interval-11)
monitor interval=12 role=Slave timeout=60 (ttsyncagresource-monitor-interval-12)
notify interval=0s timeout=60 (ttsyncagresource-notify-interval-0s)
promote interval=0s timeout=60 (ttsyncagresource-promote-interval-0s)
start interval=0s timeout=60 (ttsyncagresource-start-interval-0s)
stop interval=0s timeout=10 (ttsyncagresource-stop-interval-0s)
Stonith Devices:
Fencing Levels:
Location Constraints:
Ordering Constraints:
Colocation Constraints:
Ticket Constraints:
Alerts:
No alerts defined
Resources Defaults:
No defaults set
Operations Defaults:
No defaults set
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: syncdb-stag
cluster-recheck-interval: 2min
dc-version: 1.1.20-5.el7_7.1-3c4c782f70
have-watchdog: false
start-failure-is-fatal: true
stonith-enabled: false
Quorum:
Options:
我从头开始重建集群,它运行良好,不确定哪里出错了,但这次我在开始之前进行了完整的系统更新。
我们在 MS Documentation 之后的 linux 上设置了一个 sql2017 集群。 AG 中的复制工作正常,但我们无法进行故障转移。如果我在故障转移过程中查看日志,pacemaker 正在尝试移动 AG,但它失败并继续 运行ning 在主服务器上。
在 master 上它报告资源不是 运行ning。
Oct 01 15:06:56 [4346] syncdb01a-stag lrmd: notice: operation_finished: ttsyncagresource_monitor_11000:6280:stderr [ resource ttsyncagresource is NOT running ]
在辅助服务器上我看到这个未知错误:
Oct 01 15:06:57 [24662] syncdb01b-stag pengine: warning: unpack_rsc_op_failure: Processing failed start of ttsyncagresource:1 on syncdb01b-stag: unknown error | rc=1
如果我 运行 pcs status
我得到以下结果。它显示的最新错误是如果我关闭主节点会发生什么。其他两个错误是由于 sql 权限导致的,已解决。
[root@syncdb01a-stag oper]# pcs status
Cluster name: syncdb-stag
Stack: corosync
Current DC: syncdb01b-stag (version 1.1.20-5.el7_7.1-3c4c782f70) - partition with quorum
Last updated: Tue Oct 1 20:36:32 2019
Last change: Tue Oct 1 15:53:57 2019 by root via crm_resource on syncdb01a-stag
3 nodes configured
3 resources configured
Online: [ syncdb01a-stag syncdb01b-stag syncwit01-stag ]
Full list of resources:
Master/Slave Set: ttsyncagresource-master [ttsyncagresource]
Masters: [ syncdb01a-stag ]
Stopped: [ syncdb01b-stag syncwit01-stag ]
Failed Resource Actions:
* ttsyncagresource_monitor_11000 on syncdb01a-stag 'not running' (7): call=17, status=complete, exitreason='',
last-rc-change='Tue Oct 1 15:03:47 2019', queued=0ms, exec=0ms
* ttsyncagresource_start_0 on syncdb01b-stag 'unknown error' (1): call=17, status=complete, exitreason='2019/10/01 14:43:30 Did not find AG row in sys.availability_groups',
last-rc-change='Tue Oct 1 14:43:25 2019', queued=0ms, exec=5255ms
* ttsyncagresource_start_0 on syncwit01-stag 'unknown error' (1): call=17, status=complete, exitreason='2019/10/01 14:43:30 Did not find AG row in sys.availability_groups',
last-rc-change='Tue Oct 1 14:43:25 2019', queued=1ms, exec=5228ms
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
我还删除了所有限制(由于是多子网,我们没有使用虚拟 ip)
[root@syncdb01a-stag oper]# pcs constraint
Location Constraints:
Ordering Constraints:
Colocation Constraints:
Ticket Constraints:
这是 pcs config
的输出:
[root@syncdb01a-stag oper]# pcs config
Cluster Name: syncdb-stag
Corosync Nodes:
syncdb01a-stag syncdb01b-stag syncwit01-stag
Pacemaker Nodes:
syncdb01a-stag syncdb01b-stag syncwit01-stag
Resources:
Master: ttsyncagresource-master
Meta Attrs: notify=true
Resource: ttsyncagresource (class=ocf provider=mssql type=ag)
Attributes: ag_name=ttsyncag
Meta Attrs: failure=timeout=60s notify=true
Operations: demote interval=0s timeout=10 (ttsyncagresource-demote-interval-0s)
monitor interval=10 timeout=60 (ttsyncagresource-monitor-interval-10)
monitor interval=11 role=Master timeout=60 (ttsyncagresource-monitor-interval-11)
monitor interval=12 role=Slave timeout=60 (ttsyncagresource-monitor-interval-12)
notify interval=0s timeout=60 (ttsyncagresource-notify-interval-0s)
promote interval=0s timeout=60 (ttsyncagresource-promote-interval-0s)
start interval=0s timeout=60 (ttsyncagresource-start-interval-0s)
stop interval=0s timeout=10 (ttsyncagresource-stop-interval-0s)
Stonith Devices:
Fencing Levels:
Location Constraints:
Ordering Constraints:
Colocation Constraints:
Ticket Constraints:
Alerts:
No alerts defined
Resources Defaults:
No defaults set
Operations Defaults:
No defaults set
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: syncdb-stag
cluster-recheck-interval: 2min
dc-version: 1.1.20-5.el7_7.1-3c4c782f70
have-watchdog: false
start-failure-is-fatal: true
stonith-enabled: false
Quorum:
Options:
我从头开始重建集群,它运行良好,不确定哪里出错了,但这次我在开始之前进行了完整的系统更新。