应用无法连接到 MongoDB Enterprise,"Authentication failed",Ops Manager 卡在 "AdjustUsers"

App fails to connect to MongoDB Enterprise with "Authentication failed" and Ops Manager stuck in "AdjustUsers"

新部署的应用程序的凭据被 MongoDB 和 "Authentication failed" 拒绝。 MongoDB Ops Manager 已经在 "AdjustUsers" 上停留了几个小时。


验证人:

cf service-connector 8080 opsmanager.service.consul:8080

http://localhost:8080 打开浏览器并使用在门户上获得的 mongodb 服务密钥登录:

"ops_manager_url": "http://opsmanager.service.consul:8080",
"ops_manager_user": "xxx",
"ops_manager_password": "xxx",

我最近使用 mongorestore 恢复了大约 20GB 的数据:

cf service-connector 27020 xxx-0.service.consul:33602
mongorestore --gzip --port 27020 --username xxx --password xxx --db rs_xxx /backup/mongodump/ > mongorestore.log 2>&1 &

其他已经存在的应用没有问题。只有我使用蓝绿部署部署的应用无法连接:

cf push $APP-new 
cf map-route $APP-new $DOMAIN --hostname $APP
cf unmap-route $APP $DOMAIN --hostname $APP || true
cf unmap-route $APP-new $DOMAIN --hostname $APP-new
cf delete -f   $APP
cf rename      $APP-new $APP

在我的应用程序中,我没有指定 WriteConcern,所以我假设它只是主要的 ("w: 1")

对于每个 cf bind-servicecf unbind-serviceServicebroker(Cloud Foundry 组件)都会创建新的随机生成的凭据(请参阅 cf env $APP)。通过我们的 MongoEnterprise 服务产品,Service Broker 连接到 Ops Manager API 并将新用户部署到 mongod 副本集。

当 Ops Manager 部署用户 (createUser) 时,他使用 WriteConcern "majority",这意味着至少有一个辅助节点需要确认写入。

COMMAND  [conn53768] command rs_$DBNAME.$cmd command: createUser { createUser: "$USERNAME", pwd: "xxx", digestPassword: false, roles: [ { role: "readWrite", db: "rs_$DBNAME" } ], writeConcern: { w: "majority" } } keyUpdates:0 writeConflicts:0 numYields:0 reslen:138 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { W: 2 } }, Collection: { acquireCount: { w: 2 } } } protocol:op_query 9811ms

由于 mongorestore 是在该问题发生前几个小时执行的,因此 createUser 命令可能由于未得到辅助节点的确认而超时忙于在前台和后台建立索引。索引构建(默认 mongorestore 选项)首先在主节点上完成,完成后将在辅助节点上进行。这解释了 createUser.

期间次级节点的滞后上升

索引构建可能需要很长时间,具体取决于索引和数据大小。我们听到很多客户抱怨索引构建时间太长。

这是索引构建的日志。在 Ops Manager 中,客户可以看到流式传输 mongodb.log (opsmanager.service.consul),所有副本集成员都有一点延迟。

...
2017-02-15T10:40:06.199+0000 I INDEX    [repl writer worker 10] build index done.  scanned 1108917 total records. 672 secs
2017-02-15T10:50:08.553+0000 I INDEX    [repl writer worker 4] build index done.  scanned 1108917 total records. 602 secs
2017-02-15T11:01:13.888+0000 I INDEX    [repl writer worker 7] build index done.  scanned 1108917 total records. 665 secs
...
2017-02-15T15:01:37.405+0000 I INDEX    [repl index builder 176] build index done.  scanned 1109531 total records. 659 secs
2017-02-15T15:01:37.406+0000 I INDEX    [repl index builder 170] build index done.  scanned 1109531 total records. 659 secs
2017-02-15T15:16:20.139+0000 I INDEX    [repl index builder 170] build index done.  scanned 1109699 total records. 882 secs

这些是来自 Automation Agent 的错误(Ops Manager 通过 HTTP 与受管 VM 上的 Autoamtion Agent 对话,然后 Automation Agent 使用本机协议与 mongod 对话)

[2017/02/15 13:33:36.297] [.error] [cm/mongoctl/authctl.go:updateUser26Style:697] [101] <rs_$DB_NAME> [13:33:36.297] Error updating 2.6-style user = AuthUser($USER@rs_$DB_NAME,roles=AuthRoles(1:{"Rolename":"readWrite","Db":"rs_$DB_NAME"}),interned=true,identity=2) in db = rs_$DB_NAME : <rs_$DB_NAME> [13:33:36.297] Timed out (timeout=-1ns) trying to runCommandWithTimeout(dbName=rs_$DB_NAME, cmd=[{"Name":"updateUser","Value":"$USER"},{"Name":"pwd","Value":"$PASSWORD"},{"Name":"digestPassword","Value":false},{"Name":"roles","Value":[{"db":"rs_$DB_NAME","role":"readWrite"}]}])
[2017/02/15 13:33:36.297] [.error] [cm/mongoctl/authctl.go:UpsertUser:516] [101] <rs_$DB_NAME> [13:33:36.297] Error upserting 2.6-style user = AuthUser($USER@rs_$DB_NAME,roles=AuthRoles(1:{"Rolename":"readWrite","Db":"rs_$DB_NAME"}),interned=true,identity=2) : <rs_$DB_NAME> [13:33:36.297] Error updating 2.6-style user = AuthUser($USER@rs_$DB_NAME,roles=AuthRoles(1:{"Rolename":"readWrite","Db":"rs_$DB_NAME"}),interned=true,identity=2) in db = rs_$DB_NAME : <rs_$DB_NAME> [13:33:36.297] Timed out (timeout=-1ns) trying to runCommandWithTimeout(dbName=rs_$DB_NAME, cmd=[{"Name":"updateUser","Value":"$USER"},{"Name":"pwd","Value":"$PASSWORD"},{"Name":"digestPassword","Value":false},{"Name":"roles","Value":[{"db":"rs_$DB_NAME","role":"readWrite"}]}])
[2017/02/15 13:33:36.297] [.error] [cm/action/adjustusers.go:adjustUsers:52] [101] <rs_$DB_NAME> [13:33:36.297] Error upserting user = AuthUser($USER@rs_$DB_NAME,roles=AuthRoles(1:{"Rolename":"readWrite","Db":"rs_$DB_NAME"}),interned=true,identity=2). Trying to proceed though. : <rs_$DB_NAME> [13:33:36.297] Error upserting 2.6-style user = AuthUser($USER@rs_$DB_NAME,roles=AuthRoles(1:{"Rolename":"readWrite","Db":"rs_$DB_NAME"}),interned=true,identity=2) : <rs_$DB_NAME> [13:33:36.297] Error updating 2.6-style user = AuthUser($USER@rs_$DB_NAME,roles=AuthRoles(1:{"Rolename":"readWrite","Db":"rs_$DB_NAME"}),interned=true,identity=2) in db = rs_$DB_NAME : <rs_$DB_NAME> [13:33:36.297] Timed out (timeout=-1ns) trying to runCommandWithTimeout(dbName=rs_$DB_NAME, cmd=[{"Name":"updateUser","Value":"$USER"},{"Name":"pwd","Value":"$PASSWORD"},{"Name":"digestPassword","Value":false},{"Name":"roles","Value":[{"db":"rs_$DB_NAME","role":"readWrite"}]}])
[2017/02/15 13:33:36.381] [.error] [cm/executor/executor.go:ExecutePlan:184] <rs_$DB_NAME> [13:33:36.381] Postcondition failed for step AdjustUsers because
[The value of 'currentState.UsersRight' = false, but it should be true]. Outcome=3

总结:WriteConcern primary 的一切工作正常,但 WriteConcern majority 超时,周二阻塞索引构建。

注意到数据库绑定不成功的一种方法是在取消映射和删除旧应用程序之前测试新部署的应用程序(应用程序级别的健康检查):

cf push $APP-new 
# only old app active

# test new app
curl --fail --silent --output /dev/null https://$APP-new.$DOMAIN/status

cf map-route $APP-new $DOMAIN --hostname $APP
# both apps active
cf apps
cf routes

cf unmap-route $APP $DOMAIN --hostname $APP || true
# only new app active

cf unmap-route $APP-new $DOMAIN --hostname $APP-new
cf delete -f   $APP
cf rename      $APP-new $APP

cf apps

此示例假定应用名称与主机名相同。