MongoDB 副本集 - 辅助节点反复关闭

MongoDB replica set - Secondary repeatedly goes down

我正在尝试在我的 Kubernetes 集群上设置一个 MongoDB 副本集,但辅助成员在几秒钟后不断重启。

这里有一些信息,但了解这些信息可能会有用:

暂存集群状态(4 个节点):

NAME                               CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
gke-staging-pool-1-********-****   551m         28%    3423Mi          60%       
gke-staging-pool-1-********-****   613m         31%    3752Mi          66%       
gke-staging-pool-1-********-****   960m         49%    2781Mi          49%       
gke-staging-pool-1-********-****   602m         31%    3590Mi          63%       

目前,由于主要节点似乎能够保持正常运行,并且 运行 我设法通过移除除主要节点之外的所有成员的 votes 来保持集群的运行。这样 Mongo 不会因为无法看到集合的大部分而放弃主节点,我的应用程序仍然可以进行写入。

如果我在辅助节点上将 logLevel 改为 5,我得到的唯一错误是:

2019-04-02T15:11:42.233+0000 D EXECUTOR [replication-0] Executing a task on behalf of pool replication
2019-04-02T15:11:42.233+0000 D EXECUTOR [replication-0] Not reaping because the earliest retirement date is 2019-04-02T15:12:11.980+0000
2019-04-02T15:11:42.233+0000 D NETWORK  [RS] Timer received error: CallbackCanceled: Callback was canceled
2019-04-02T15:11:42.235+0000 D NETWORK  [RS] Decompressing message with snappy
2019-04-02T15:11:42.235+0000 D ASIO     [RS] Request 114334 finished with response: { cursor: { nextBatch: [], id: 46974224885, ns: "local.oplog.rs" }, ok: 1.0, operationTime: Timestamp(1554217899, 1), $replData: { term: 11536, lastOpCommitted: { ts: Timestamp(1554217899, 1), t: 11536 }, lastOpVisible: { ts: Timestamp(1554217899, 1), t: 11536 }, configVersion: 666752, replicaSetId: ObjectId('5c8a607380091703c787b3ff'), primaryIndex: 0, syncSourceIndex: -1 }, $oplogQueryData: { lastOpCommitted: { ts: Timestamp(1554217899, 1), t: 11536 }, lastOpApplied: { ts: Timestamp(1554217899, 1), t: 11536 }, rbid: 1, primaryIndex: 0, syncSourceIndex: -1 }, $clusterTime: { clusterTime: Timestamp(1554217899, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } }
2019-04-02T15:11:42.235+0000 D EXECUTOR [RS] Received remote response: RemoteResponse --  cmd:{ cursor: { nextBatch: [], id: 46974224885, ns: "local.oplog.rs" }, ok: 1.0, operationTime: Timestamp(1554217899, 1), $replData: { term: 11536, lastOpCommitted: { ts: Timestamp(1554217899, 1), t: 11536 }, lastOpVisible: { ts: Timestamp(1554217899, 1), t: 11536 }, configVersion: 666752, replicaSetId: ObjectId('5c8a607380091703c787b3ff'), primaryIndex: 0, syncSourceIndex: -1 }, $oplogQueryData: { lastOpCommitted: { ts: Timestamp(1554217899, 1), t: 11536 }, lastOpApplied: { ts: Timestamp(1554217899, 1), t: 11536 }, rbid: 1, primaryIndex: 0, syncSourceIndex: -1 }, $clusterTime: { clusterTime: Timestamp(1554217899, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } }
2019-04-02T15:11:42.235+0000 D EXECUTOR [replication-5] Executing a task on behalf of pool replication
2019-04-02T15:11:42.235+0000 D REPL     [replication-5] oplog fetcher read 0 operations from remote oplog
2019-04-02T15:11:42.235+0000 D EXECUTOR [replication-5] Scheduling remote command request: RemoteCommand 114336 -- target:foodchain-backend-mongodb-replicaset-0.foodchain-backend-mongodb-replicaset.foodchain.svc.cluster.local:27017 db:local expDate:2019-04-02T15:11:47.285+0000 cmd:{ getMore: 46974224885, collection: "oplog.rs", batchSize: 13981010, maxTimeMS: 50, term: 11536, lastKnownCommittedOpTime: { ts: Timestamp(1554217899, 1), t: 11536 } }
2019-04-02T15:11:42.235+0000 D ASIO     [replication-5] startCommand: RemoteCommand 114336 -- target:foodchain-backend-mongodb-replicaset-0.foodchain-backend-mongodb-replicaset.foodchain.svc.cluster.local:27017 db:local expDate:2019-04-02T15:11:47.285+0000 cmd:{ getMore: 46974224885, collection: "oplog.rs", batchSize: 13981010, maxTimeMS: 50, term: 11536, lastKnownCommittedOpTime: { ts: Timestamp(1554217899, 1), t: 11536 } }
2019-04-02T15:11:42.235+0000 D EXECUTOR [replication-5] Not reaping because the earliest retirement date is 2019-04-02T15:12:11.980+0000
2019-04-02T15:11:42.235+0000 D NETWORK  [RS] Timer received error: CallbackCanceled: Callback was canceled
2019-04-02T15:11:42.235+0000 D NETWORK  [RS] Compressing message with snappy
2019-04-02T15:11:42.235+0000 D NETWORK  [RS] Timer received error: CallbackCanceled: Callback was canceled
2019-04-02T15:11:42.235+0000 D NETWORK  [RS] Timer received error: CallbackCanceled: Callback was canceled
2019-04-02T15:11:42.235+0000 D NETWORK  [RS] Timer received error: CallbackCanceled: Callback was canceled

鉴于网络错误,我验证了所有成员是否可以相互连接并且他们可以(它明确显示在所有三个成员的日志中)。

附加信息:

foodchain_rs:PRIMARY> rs.status()
{
    "set" : "foodchain_rs",
    "date" : ISODate("2019-04-02T15:35:02.640Z"),
    "myState" : 1,
    "term" : NumberLong(11536),
    "syncingTo" : "",
    "syncSourceHost" : "",
    "syncSourceId" : -1,
    "heartbeatIntervalMillis" : NumberLong(2000),
    "optimes" : {
        "lastCommittedOpTime" : {
            "ts" : Timestamp(1554219299, 1),
            "t" : NumberLong(11536)
        },
        "readConcernMajorityOpTime" : {
            "ts" : Timestamp(1554219299, 1),
            "t" : NumberLong(11536)
        },
        "appliedOpTime" : {
            "ts" : Timestamp(1554219299, 1),
            "t" : NumberLong(11536)
        },
        "durableOpTime" : {
            "ts" : Timestamp(1554219299, 1),
            "t" : NumberLong(11536)
        }
    },
    "members" : [
        {
            "_id" : 0,
            "name" : "foodchain-backend-mongodb-replicaset-0.foodchain-backend-mongodb-replicaset.foodchain.svc.cluster.local:27017",
            "health" : 1,
            "state" : 1,
            "stateStr" : "PRIMARY",
            "uptime" : 4376,
            "optime" : {
                "ts" : Timestamp(1554219299, 1),
                "t" : NumberLong(11536)
            },
            "optimeDate" : ISODate("2019-04-02T15:34:59Z"),
            "syncingTo" : "",
            "syncSourceHost" : "",
            "syncSourceId" : -1,
            "infoMessage" : "",
            "electionTime" : Timestamp(1554214927, 1),
            "electionDate" : ISODate("2019-04-02T14:22:07Z"),
            "configVersion" : 666752,
            "self" : true,
            "lastHeartbeatMessage" : ""
        },
        {
            "_id" : 1,
            "name" : "foodchain-backend-mongodb-replicaset-1.foodchain-backend-mongodb-replicaset.foodchain.svc.cluster.local:27017",
            "health" : 1,
            "state" : 2,
            "stateStr" : "SECONDARY",
            "uptime" : 10,
            "optime" : {
                "ts" : Timestamp(1554219299, 1),
                "t" : NumberLong(11536)
            },
            "optimeDurable" : {
                "ts" : Timestamp(1554219299, 1),
                "t" : NumberLong(11536)
            },
            "optimeDate" : ISODate("2019-04-02T15:34:59Z"),
            "optimeDurableDate" : ISODate("2019-04-02T15:34:59Z"),
            "lastHeartbeat" : ISODate("2019-04-02T15:35:01.747Z"),
            "lastHeartbeatRecv" : ISODate("2019-04-02T15:35:01.456Z"),
            "pingMs" : NumberLong(0),
            "lastHeartbeatMessage" : "",
            "syncingTo" : "foodchain-backend-mongodb-replicaset-0.foodchain-backend-mongodb-replicaset.foodchain.svc.cluster.local:27017",
            "syncSourceHost" : "foodchain-backend-mongodb-replicaset-0.foodchain-backend-mongodb-replicaset.foodchain.svc.cluster.local:27017",
            "syncSourceId" : 0,
            "infoMessage" : "",
            "configVersion" : 666752
        }
    ],
    "ok" : 1,
    "operationTime" : Timestamp(1554219299, 1),
    "$clusterTime" : {
        "clusterTime" : Timestamp(1554219299, 1),
        "signature" : {
            "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
            "keyId" : NumberLong(0)
        }
    }
}
foodchain_rs:PRIMARY> rs.printReplicationInfo()
configured oplog size:   1228.8701171875MB
log length start to end: 1646798secs (457.44hrs)
oplog first event time:  Thu Mar 14 2019 14:08:51 GMT+0000 (UTC)
oplog last event time:   Tue Apr 02 2019 15:35:29 GMT+0000 (UTC)
now:                     Tue Apr 02 2019 15:35:34 GMT+0000 (UTC)
foodchain_rs:PRIMARY> db.getReplicationInfo()
{
    "logSizeMB" : 1228.8701171875,
    "usedMB" : 4.7,
    "timeDiff" : 1646838,
    "timeDiffHours" : 457.46,
    "tFirst" : "Thu Mar 14 2019 14:08:51 GMT+0000 (UTC)",
    "tLast" : "Tue Apr 02 2019 15:36:09 GMT+0000 (UTC)",
    "now" : "Tue Apr 02 2019 15:36:11 GMT+0000 (UTC)"
}
foodchain_rs:PRIMARY> rs.conf()
{
    "_id" : "foodchain_rs",
    "version" : 666752,
    "protocolVersion" : NumberLong(1),
    "writeConcernMajorityJournalDefault" : true,
    "members" : [
        {
            "_id" : 0,
            "host" : "foodchain-backend-mongodb-replicaset-0.foodchain-backend-mongodb-replicaset.foodchain.svc.cluster.local:27017",
            "arbiterOnly" : false,
            "buildIndexes" : true,
            "hidden" : false,
            "priority" : 1,
            "tags" : {

            },
            "slaveDelay" : NumberLong(0),
            "votes" : 1
        },
        {
            "_id" : 1,
            "host" : "foodchain-backend-mongodb-replicaset-1.foodchain-backend-mongodb-replicaset.foodchain.svc.cluster.local:27017",
            "arbiterOnly" : false,
            "buildIndexes" : true,
            "hidden" : false,
            "priority" : 0,
            "tags" : {

            },
            "slaveDelay" : NumberLong(0),
            "votes" : 0
        }
    ],
    "settings" : {
        "chainingAllowed" : true,
        "heartbeatIntervalMillis" : 2000,
        "heartbeatTimeoutSecs" : 10,
        "electionTimeoutMillis" : 100,
        "catchUpTimeoutMillis" : -1,
        "catchUpTakeoverDelayMillis" : 30000,
        "getLastErrorModes" : {

        },
        "getLastErrorDefaults" : {
            "w" : 1,
            "wtimeout" : 0
        },
        "replicaSetId" : ObjectId("5c8a607380091703c787b3ff")
    }
}

问题是 electionTimeoutMillis 设置太短。

Lower values result in faster failover, but increased sensitivity to primary node or network slowness or spottiness.

在我的例子中,它被设置为 100ms 并且我的辅助成员没有足够的时间找到主要成员,因此它无法同步因此不可用。

我认为还值得注意的是进程没有被终止。 mongod PID 始终是 1top 中显示的正常运行时间与 rs.status() mongo shell 中显示的正常运行时间不一致。

我正在做的是通过 mongo shell 监控辅助正常运行时间,如下所示:

watch -n 1.0 "kubectl -n foodchain exec -it foodchain-backend-mongodb-replicaset-0 -- mongo --eval='rs.status().members.map(m => m.uptime)'"

通过该命令,我可以看到辅助服务器的正常运行时间永远不会超过 10 秒,所以我假设它正在自行重启或被 OOM 杀死或其他原因,相反我认为它试图启动选举但没有投票赞成这样做,并对重新启动保持沉默。事实上,我认为真正让我感到困惑的是,尽管已将 logLevel 设置为 5

,但仍缺乏这方面的信息