尝试降级 mongodb 副本集时出错

Question

设置：
1个主要，3个次要和1个仲裁（所有5个都是运行。）

当我在 primary 中执行 stepDown() 时，在 shell -

中出现以下错误

m101:PRIMARY> rs.stepDown() 
2017-06-07T15:01:21.357 E QUERY    [thread1] Error: error doing query: failed: network error while attempting to run command 'replSetStepDown' on host '127.0.0.1:27018'  :  
DB.prototype.runCommand@src/mongo/shell/db.js:132:1 
DB.prototype.adminCommand@src/mongo/shell/db.js:150:16  
rs.stepDown@src/mongo/shell/utils.js:1261:12
@(shell):1:1 
2017-06-07T15:01:21.360 I NETWORK  [thread1] trying reconnect to 127.0.0.1:27018 (127.0.0.1) failed
2017-06-07T15:01:21.361 I NETWORK  [thread1] reconnect 127.0.0.1:27018 (127.0.0.1) ok
m101:SECONDARY>

注意，最后还是把primary改成了secondary。但为什么我会收到此错误，"failed: network error" 在这里？

以下是主日志文件中已降级的部分 -

2017-06-07T15:01:22.170 I REPL     [replication-5] Restarting oplog query due to error: InterruptedDueToReplStateChange: operation was interrupted. Last fetched optime (with hash): { ts: Timestamp 1496827872000|1, t: 2 }[-3490433912114125886]. Restarts remaining: 3
2017-06-07T15:01:22.170 I REPL     [replication-5] Scheduled new oplog query Fetcher source: localhost:27018 database: local query: { find: "oplog.rs", filter: { ts: { $gte: Timestamp 1496827872000|1 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 2 } query metadata: { $replData: 1, $oplogQueryData: 1, $ssm: { $secondaryOk: true } } active: 1 timeout: 65000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 159434 -- target:localhost:27018 db:local cmd:{ find: "oplog.rs", filter: { ts: { $gte: Timestamp 1496827872000|1 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 2 } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms
2017-06-07T15:01:22.173 W REPL     [rsBackgroundSync] Fetcher stopped querying remote oplog with error: InvalidSyncSource: Sync source's last applied OpTime { ts: Timestamp 1496827872000|1, t: 2 } is not greater than our last fetched OpTime { ts: Timestamp 1496827872000|1, t: 2 }. Choosing new sync source.
2017-06-07T15:01:22.176 I REPL     [rsBackgroundSync] could not find member to sync from
2017-06-07T15:01:22.180 I REPL     [ReplicationExecutor] Member localhost:27018 is now in state SECONDARY
2017-06-07T15:01:27.335 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to localhost:27018: InvalidSyncSource: Sync source was cleared. Was localhost:27018
2017-06-07T15:01:29.879 I -        [conn2] end connection 127.0.0.1:59400 (5 connections now open)
2017-06-07T15:01:30.899 I NETWORK  [thread1] connection accepted from 127.0.0.1:64421 #28 (5 connections now open)
2017-06-07T15:01:30.899 I NETWORK  [conn28] received client metadata from 127.0.0.1:64421 conn28: { driver: { name: "NetworkInterfaceASIO-Replication", version: "3.4.4" }, os: { type: "Windows", name: "Microsoft Windows 8", architecture: "x86_64", version: "6.2 (build 9200)" } }
2017-06-07T15:01:30.917 I NETWORK  [thread1] connection accepted from 127.0.0.1:64425 #29 (6 connections now open)
2017-06-07T15:01:30.920 I NETWORK  [conn29] received client metadata from 127.0.0.1:64425 conn29: { driver: { name: "NetworkInterfaceASIO-Replication", version: "3.4.4" }, os: { type: "Windows", name: "Microsoft Windows 8", architecture: "x86_64", version: "6.2 (build 9200)" } }
2017-06-07T15:01:30.927 I NETWORK  [thread1] connection accepted from 127.0.0.1:64428 #30 (7 connections now open)
2017-06-07T15:01:30.928 I NETWORK  [conn30] received client metadata from 127.0.0.1:64428 conn30: { driver: { name: "NetworkInterfaceASIO-Replication", version: "3.4.4" }, os: { type: "Windows", name: "Microsoft Windows 8", architecture: "x86_64", version: "6.2 (build 9200)" } }
2017-06-07T15:01:31.561 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
2017-06-07T15:01:31.562 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
2017-06-07T15:01:31.563 I REPL     [ReplicationExecutor] VoteRequester(term 2 dry run) received a no vote from localhost:27018 with reason "candidate's term is lower than mine"; response message: { term: 3, voteGranted: false, reason: "candidate's term is lower than mine", ok: 1.0 }
2017-06-07T15:01:31.563 I REPL     [ReplicationExecutor] not running for primary, we have been superceded already
2017-06-07T15:01:32.183 I REPL     [ReplicationExecutor] Member localhost:27019 is now in state PRIMARY
2017-06-07T15:01:32.185 I REPL     [rsBackgroundSync] sync source candidate: localhost:27019
2017-06-07T15:01:32.186 I ASIO     [NetworkInterfaceASIO-RS-0] Connecting to localhost:27019
2017-06-07T15:01:33.198 I ASIO     [NetworkInterfaceASIO-RS-0] Successfully connected to localhost:27019, took 1012ms (1 connections now open to localhost:27019)

试图了解这里发生的事情。任何线索将不胜感激。

提前致谢！

Answer 1

这是预期的行为。当一个主要节点下台时，它会删除所有连接，包括 mongo shell 中的活动连接。 mongo shell 然后重新建立连接（根据 "trying reconnect" 消息）。

rs.stepDown() 文档可能在这一点上用 shell 输出示例更清楚，但相关描述是：

Upon successful stepdown, rs.stepDown() forces all clients currently connected to the database to disconnect. This helps ensure that the clients maintain an accurate view of the replica set.

Because the disconnect includes the connection used to run the command, you cannot retrieve the return status of the command if the command completes successfully; i.e. you can only retrieve the return status of the command if it errors. When running the command in a script, the script should account for this behavior.

尝试降级 mongodb 副本集时出错

error while trying to stepdown mongodb replica set

mongodb

mongodb-replica-set