由于 MongoDB 副本集的意外故障转移导致数据丢失

Data loss due to unexpected failover of MongoDB replica set

所以我最近遇到了以下问题：

我有一个 5-member set 副本集（优先级）

1 x 初级 (2)
2 x 二级 (0.5)
1 x 隐藏备份 (0)
1 x 仲裁器 (0)

其中一个优先级为 0.5 的辅助副本（我们称其为 B）遇到了一些网络问题并且与副本集的其余部分间歇性连接。然而，尽管数据和优先级低于现有主节点（我们称之为 A），但它承担了主要角色：

[ReplicationExecutor] VoteRequester: Got no vote from xxx because: candidate's data is staler than mine, resp:{ term: 29, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0 }

[ReplicationExecutor] election succeeded, assuming primary role in term 29

[ReplicationExecutor] transition to PRIMARY

并且对于 A，尽管与副本集的其余部分没有任何连接问题：

[ReplicationExecutor] stepping down from primary, because a new term has begun: 29

所以问题 1 是，在这种情况下，这怎么可能？

继续，A（现在是辅助）开始回滚数据：

[rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 28, timestamp: xxx). source's GTE: (term: 29, timestamp: xxx) hashes: (xxx/xxx)

[rsBackgroundSync] beginning rollback

[rsBackgroundSync] rollback 0

[ReplicationExecutor] transition to ROLLBACK

这导致写入的数据被删除。所以问题 2 是：OplogStart 是如何丢失的？

最后但同样重要的是，问题 3，如何避免这种情况？

提前致谢！

您使用的是版本 3.2.x 和 protocolVersion=1（您可以使用 rs.conf() -命令进行检查）？因为投票有"bug"。您可以通过（选择一个或两个）来防止此错误：

将协议版本更改为 0。 cfg = rs.conf(); cfg.protocolVersion=0; rs.reconfig(cfg);
将所有优先级更改为相同的值

编辑：这些是解释什么的票..或多或少.. Ticket 1 Ticket 2

由于 MongoDB 副本集的意外故障转移导致数据丢失

Data loss due to unexpected failover of MongoDB replica set

replication

failover

mongodb

mongodb-replica-set