领事服务器无法重新加入集群,几秒钟后离开

Consul server fails to rejoin the cluster, leaves after few seconds

我的案例的简短概述:

最初我有一个简单的 运行 consul 集群,其中包含一个服务器 (server1) 和多个客户端。这是我的 server1.json 配置文件:

{
    "server": true,
    "datacenter": "dc1",
    "data_dir": "/opt/consul",
    "bind_addr": "0.0.0.0",
    "client_addr":"0.0.0.0",
    "bootstrap_expect": 3,
    "ui": true,
    "retry_join": ["provider=aws tag_key=Function tag_value=consul-server"],
    "encrypt":"WT2T9..."
}

注意:bootstrap_expect原来是=1,我加新服务器的时候改成了3。

出于某些测试目的,我想向集群中再添加两台服务器。

所以我一台一台地配置和添加服务器。一切都按预期工作,我可以确认我的集群现在有 3 个服务器,其中两个新服务器(server1 和 server2)是跟随者。

这是我的 server2.json 配置文件:

server3 的配置文件看起来一样。

现在,当我的集群 运行 时,出于测试目的,我停止了作为领导者的 server1 的服务。结果,选出了新的leader。

consul members命令显示:

Node            Address           Status   Type   Build  Protocol  DC   Segment
server2  xx.xx.x.xx1:8301  alive   server  1.9.3  2         dc1  <all>
server1  xx.xx.x.xx2:8301  left    server  1.8.3  2         dc1  <all>
server3  xx.xx.x.xx3:8301  alive   server  1.9.3  2         dc1  <all>

看起来不错! Server1已离开。

现在我想将 server1 带回我的集群。

所以我重新启动了 server1 的 consul.service,我希望它能重新加入集群。

并加入了它,但几秒钟后它失败并离开了集群。

这里有一些输出:

onsul members的输出:

 consul.service - Consul server agent
   Loaded: loaded (/etc/systemd/system/consul.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Thu 2021-02-04 13:33:25 UTC; 3s ago
     Docs: https://www.consul.io/
  Process: 18632 ExecStart=/usr/local/bin/consul agent -ui -config-dir=/etc/consul.d/ (code=exited, status=2)
 Main PID: 18632 (code=exited, status=2)

Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/raft.(*Raft).runFSM.func2(0xc0007ec400, 0x40, 0x40)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:         /go/pkg/mod/github.com/hashicorp/raft@v1.1.2/fsm.go:113 +0x75
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/raft.(*Raft).runFSM(0xc000302900)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:         /go/pkg/mod/github.com/hashicorp/raft@v1.1.2/fsm.go:219 +0x42f
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc000302900, 0xc00085e340)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:         /go/pkg/mod/github.com/hashicorp/raft@v1.1.2/state.go:146 +0x55
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: created by github.com/hashicorp/raft.(*raftState).goFunc
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:         /go/pkg/mod/github.com/hashicorp/raft@v1.1.2/state.go:144 +0x66
Feb 04 13:33:25 ip-xx-xx-x-xxx systemd[1]: consul.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Feb 04 13:33:25 ip-xx-xx-x-xxx systemd[1]: consul.service: Failed with result 'exit-code'.

journalctl -xe -u consul 的输出显示如下

Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:     2021-02-04T13:33:25.267Z [INFO]  agent: discover-aws: Instance i-0ae416a280967e345 has private ip 10.10.1.555: cluster=LAN
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:     2021-02-04T13:33:25.268Z [INFO]  agent: discover-aws: Instance i-0e5394a632853e200 has private ip 10.10.1.777: cluster=LAN
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:     2021-02-04T13:33:25.268Z [INFO]  agent: discover-aws: Instance i-019966401267318b4 has private ip 10.10.1.222: cluster=LAN
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:     2021-02-04T13:33:25.268Z [INFO]  agent: Discovered servers: cluster=LAN cluster=LAN servers="10.10.1.555 10.10.1.777 10.10.1.222"
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:     2021-02-04T13:33:25.268Z [INFO]  agent: (LAN) joining: lan_addresses=[10.10.1.555, 10.10.1.777, 10.10.1.222]
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:     2021-02-04T13:33:25.271Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: ip-10-10-1-777 10.10.1.777
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:     2021-02-04T13:33:25.271Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: ip-10-10-1-222 10.10.1.222
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:     2021-02-04T13:33:25.272Z [INFO]  agent.server: Adding LAN server: server="ip-10-10-1-777 (Addr: tcp/10.10.1.777:8300) (DC: dc1)"
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:     2021-02-04T13:33:25.272Z [INFO]  agent.server: Adding LAN server: server="ip-10-10-1-222 (Addr: tcp/10.10.1.41:8300) (DC: dc1)"
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:     2021-02-04T13:33:25.274Z [INFO]  agent: (LAN) joined: number_of_nodes=3
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:     2021-02-04T13:33:25.275Z [INFO]  agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=3
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:     2021-02-04T13:33:25.278Z [WARN]  agent.server.raft: failed to get previous log: previous-index=93400 last-index=93262 error="log not found"
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:     2021-02-04T13:33:25.505Z [INFO]  agent: Synced node info
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:     2021-02-04T13:33:25.526Z [INFO]  agent: Deregistered check: check=vault:10.10.1.555:8200:vault-sealed-check
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: panic: failed to decode request: invalid config entry kind: service-intentions
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: goroutine 37 [running]:
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/consul/agent/consul/fsm.(*FSM).applyConfigEntryOperation(0xc000923c80, 0xc0008b0421, 0x150, 0x150, 0x169ef, 0x0, 0x0)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:         /home/circleci/project/consul/agent/consul/fsm/commands_oss.go:453 +0xb27
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/consul/agent/consul/fsm.New.func1(0xc0008b0421, 0x150, 0x150, 0x169ef, 0xc000538a50, 0xc000eb9380)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:         /home/circleci/project/consul/agent/consul/fsm/fsm.go:85 +0x56
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/consul/agent/consul/fsm.(*FSM).Apply(0xc000923c80, 0xc0013702d0, 0x0, 0x0)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:         /home/circleci/project/consul/agent/consul/fsm/fsm.go:120 +0x24b
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/go-raftchunking.(*ChunkingFSM).Apply(0xc000984780, 0xc0013702d0, 0x53a3f80, 0xbfff1bc9745fc84c)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:         /go/pkg/mod/github.com/hashicorp/go-raftchunking@v0.6.1/fsm.go:66 +0x5b
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/raft.(*Raft).runFSM.func1(0xc000956fe0)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:         /go/pkg/mod/github.com/hashicorp/raft@v1.1.2/fsm.go:90 +0x2c1
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/raft.(*Raft).runFSM.func2(0xc0007ec400, 0x40, 0x40)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:         /go/pkg/mod/github.com/hashicorp/raft@v1.1.2/fsm.go:113 +0x75
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/raft.(*Raft).runFSM(0xc000302900)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:         /go/pkg/mod/github.com/hashicorp/raft@v1.1.2/fsm.go:219 +0x42f
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc000302900, 0xc00085e340)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:         /go/pkg/mod/github.com/hashicorp/raft@v1.1.2/state.go:146 +0x55
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: created by github.com/hashicorp/raft.(*raftState).goFunc
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:         /go/pkg/mod/github.com/hashicorp/raft@v1.1.2/state.go:144 +0x66
Feb 04 13:33:25 ip-xx-xx-x-xxx systemd[1]: consul.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Feb 04 13:33:25 ip-xx-xx-x-xxx systemd[1]: consul.service: Failed with result 'exit-code'.

正如您从第二个输出中看到的那样,server1 实际上正在加入集群,但随后发生了一些事情并离开了集群。

panic: failed to decode request: invalid config entry kind: service-intentions

我尝试以不同的方式修改配置文件并重新启动 consul server1 --> 没有成功。

最后我清空了 consul data_dir 并重新启动了 server1 --> 但没有成功。

有没有人经历过这样的事情,可以在这里支持一下?

如果您查看提供的“紧急”行上方的几行,您会看到:

[WARN] agent.server.raft:无法获取以前的日志:previous-index=93400 last-index=93262 error="找不到日志"。

日志复制似乎有问题。当前领导者无法将日志复制到新的追随者。

虽然我猜这可能有多种原因,但一个可能的问题可能是 consul 的版本。据我从您的 consul members 命令中看到, server1 的版本比您的领导者的版本旧(无论是 server2 还是 server3)。

Node            Address           Status   Type   Build  Protocol  DC   Segment
server2  xx.xx.x.xx1:8301  alive   server  1.9.3  2         dc1  <all>
server1  xx.xx.x.xx2:8301  left    server  1.8.3  2         dc1  <all>
server3  xx.xx.x.xx3:8301  alive   server  1.9.3  2         dc1  <all>

尝试从有问题的节点(在您的情况下为 server1)中删除 consul 二进制文件,并将其替换为新版本。之后重启领事。