如何将现有的 TiKV 节点连接到 TiDB 中的新 PD 集群?

How do I connect existing TiKV nodes to a new cluster of PDs in TiDB?

我在 gcloud 中有一个可用的 TiDB 实例 运行,使用 tidb-ansible 脚本部署。我想用新的替换 PD 节点,所以我销毁并替换了它们。现在 PD 集群正常运行了,但是当我尝试启动 TiKV 节点时,我得到了这个错误:

2018/02/28 01:42:08.091 node.rs:191: [ERROR] cluster ID mismatch: local_id 6520261967047847245 remote_id 6527407705559138241. you are trying to connect to another cluster, please reconnect to the correct PD

TiDB FAQs (https://pingcap.com/docs/FAQ/) 中对错误有很好的解释:

-- The cluster ID mismatch message is displayed when starting TiKV. --

This is because the cluster ID stored in local TiKV is different from the cluster ID specified by PD. When a new PD cluster is deployed, PD generates random cluster IDs. TiKV gets the cluster ID from PD and stores the cluster ID locally when it is initialized. The next time when TiKV is started, it checks the local cluster ID with the cluster ID in PD. If the cluster IDs don’t match, the cluster ID mismatch message is displayed and TiKV exits.

If you previously deploy a PD cluster, but then you remove the PD data and deploy a new PD cluster, this error occurs because TiKV uses the old data to connect to the new PD cluster.

但没有说明如何解决问题。有没有办法销毁 TiKV 实例上的本地集群 ID,使其能正常挂接到 PD?

如果我能让他们再次通话,PD 是否能够协调我现有的 TiKV 节点(与现有数据)?

Is there a way to destroy the local cluster ID on the TiKV instance so it can hook up with the PD properly?

为了让 TiKV 实例和 PD 正确挂钩,可以修改 PD 的集群 ID。请参阅以下步骤和示例。

Will PD be able to coordinate my existing TiKV nodes (with existing data) if I can get them to talk again?

是的,会的。

您可以使用“pd-recover”来解决这个问题。

  • 步骤 1. 运行 新 pd-server 集群 ID:6527407705559138241.

  • 步骤2.将集群ID更改为6520261967047847245

    ./pd-recover --endpoints "http://the-new-pd-server:port" --cluster-id 6520261967047847245 --alloc-id 100000000
    
  • 第三步,重启PD服务器

请注意,PD 有一个 MONOTONIC 唯一 ID 分配器,即 alloc-id。所有区域 ID 和对等 ID 均由分配器生成。所以请确保你在步骤 2 中选择的 ID 足够大,任何现有的 ID 都不能超过,否则会 corrupt TiKV。


示例

neil:bin/ (master) $ ./pd-server &
[1] 32718
2018/03/01 10:51:01.343 util.go:59: [info] Welcome to Placement Driver (PD).                                                                                                                                                                                                               
2018/03/01 10:51:01.343 util.go:60: [info] Release Version: 0.9.0
2018/03/01 10:51:01.343 util.go:61: [info] Git Commit Hash: 651d0dd52a46b7990d0cd74d33f2f10194d46565
2018/03/01 10:51:01.343 util.go:62: [info] Git Branch: namespace
2018/03/01 10:51:01.343 util.go:63: [info] UTC Build Time:  2017-09-13 05:30:13
2018/03/01 10:51:01.343 metricutil.go:83: [info] disable Prometheus push client
2018/03/01 10:51:01.344 server.go:87: [info] PD config - Config({FlagSet:0xc420177500 Version:false ClientUrls:http://127.0.0.1:2379 PeerUrls:http://127.0.0.1:2380 AdvertiseClientUrls:http://127.0.0.1:2379 AdvertisePeerUrls:http://127.0.0.1:2380 Name:pd DataDir:default.pd InitialCluster:pd=http://127.0.0.1:2380 InitialClusterState:new Join: LeaderLease:3 Log:{Level: Format:text DisableTimestamp:false File:{Filename: LogRotate:true MaxSize:0 MaxDays:0 MaxBackups:0}} LogFileDeprecated: LogLevelDeprecated: TsoSaveInterval:3s Metric:{PushJob:pd PushAddress: PushInterval:0s} Schedule:{MaxSnapshotCount:3 MaxStoreDownTime:1h0m0s LeaderScheduleLimit:64 RegionScheduleLimit:12 ReplicaScheduleLimit:16} Replication:{MaxReplicas:3 LocationLabels:[]} QuotaBackendBytes:0 AutoCompactionRetention:1 TickInterval:500ms ElectionInterval:3s configFile: WarningMsgs:[] nextRetryDelay:1000000000 disableStrictReconfigCheck:false})
2018/03/01 10:51:01.346 server.go:114: [info] start embed etcd
2018/03/01 10:51:01.347 log.go:84: [info] embed: [listening for peers on  http://127.0.0.1:2380]
2018/03/01 10:51:01.347 log.go:84: [info] embed: [pprof is enabled under /debug/pprof]
2018/03/01 10:51:01.347 log.go:84: [info] embed: [listening for client requests on  127.0.0.1:2379]
2018/03/01 10:51:01 systime_mon.go:11: [info] start system time monitor 
2018/03/01 10:51:01.408 log.go:84: [info] etcdserver: [name = pd]
2018/03/01 10:51:01.409 log.go:84: [info] etcdserver: [data dir = default.pd]
2018/03/01 10:51:01.409 log.go:84: [info] etcdserver: [member dir = default.pd/member]
2018/03/01 10:51:01.409 log.go:84: [info] etcdserver: [heartbeat = 500ms]
2018/03/01 10:51:01.409 log.go:84: [info] etcdserver: [election = 3000ms]
2018/03/01 10:51:01.409 log.go:84: [info] etcdserver: [snapshot count = 100000]
2018/03/01 10:51:01.409 log.go:84: [info] etcdserver: [advertise client URLs = http://127.0.0.1:2379]
2018/03/01 10:51:01.409 log.go:84: [info] etcdserver: [initial advertise peer URLs = http://127.0.0.1:2380]
2018/03/01 10:51:01.409 log.go:84: [info] etcdserver: [initial cluster = pd=http://127.0.0.1:2380]
2018/03/01 10:51:01.475 log.go:84: [info] etcdserver: [starting member b71f75320dc06a6c in cluster 1c45a069f3a1d796]
2018/03/01 10:51:01.475 log.go:84: [info] raft: [b71f75320dc06a6c became follower at term 0]
2018/03/01 10:51:01.475 log.go:84: [info] raft: [newRaft b71f75320dc06a6c [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]]
2018/03/01 10:51:01.475 log.go:84: [info] raft: [b71f75320dc06a6c became follower at term 1]
2018/03/01 10:51:01.587 log.go:80: [warning] auth: [simple token is not cryptographically signed]
2018/03/01 10:51:01.631 log.go:84: [info] etcdserver: [starting server... [version: 3.2.4, cluster version: to_be_decided]]
2018/03/01 10:51:01.632 log.go:84: [info] etcdserver/membership: [added member b71f75320dc06a6c [http://127.0.0.1:2380] to cluster 1c45a069f3a1d796]
2018/03/01 10:51:01.633 server.go:129: [info] create etcd v3 client with endpoints [http://127.0.0.1:2379]
2018/03/01 10:51:03.476 log.go:84: [info] raft: [b71f75320dc06a6c is starting a new election at term 1]
2018/03/01 10:51:03.476 log.go:84: [info] raft: [b71f75320dc06a6c became candidate at term 2]
2018/03/01 10:51:03.476 log.go:84: [info] raft: [b71f75320dc06a6c received MsgVoteResp from b71f75320dc06a6c at term 2]
2018/03/01 10:51:03.476 log.go:84: [info] raft: [b71f75320dc06a6c became leader at term 2]
2018/03/01 10:51:03.476 log.go:84: [info] raft: [raft.node: b71f75320dc06a6c elected leader b71f75320dc06a6c at term 2]
2018/03/01 10:51:03.477 log.go:84: [info] etcdserver: [setting up the initial cluster version to 3.2]
2018/03/01 10:51:03.477 log.go:84: [info] etcdserver: [published {Name:pd ClientURLs:[http://127.0.0.1:2379]} to cluster 1c45a069f3a1d796]
2018/03/01 10:51:03.477 log.go:84: [info] embed: [ready to serve client requests]
2018/03/01 10:51:03.478 log.go:82: [info] embed: [serving insecure client requests on 127.0.0.1:2379, this is strongly discouraged!]
2018/03/01 10:51:03.480 etcdutil.go:125: [warning] check etcd http://127.0.0.1:2379 status, resp: &{cluster_id:2037210783374497686 member_id:13195394291058371180 revision:1 raft_term:2  3.2.4 24576 13195394291058371180 3 2}, err: <nil>, cost: 1.84566554s
2018/03/01 10:51:03.489 log.go:82: [info] etcdserver/membership: [set the initial cluster version to 3.2]
2018/03/01 10:51:03.489 log.go:84: [info] etcdserver/api: [enabled capabilities for version 3.2]
2018/03/01 10:51:03.500 server.go:174: [info] init cluster id 6527803384525484955
2018/03/01 10:51:03.579 tso.go:104: [info] sync and save timestamp: last 0001-01-01 00:00:00 +0000 UTC save 2018-03-01 10:51:06.578778001 +0800 CST
2018/03/01 10:51:03.579 leader.go:249: [info] PD cluster leader pd is ready to serve

neil:bin/ (master) $ ./pd-recover --endpoints "http://localhost:2379" --alloc-id 100000000 --cluster-id 66666666666
recover success! please restart the PD cluster
neil:bin/ (master) $ kill 32718
2018/03/01 10:51:35.258 server.go:228: [info] closing server
2018/03/01 10:51:35.258 leader.go:107: [error] campaign leader err github.com/pingcap/pd/server/leader.go:269: server closed
2018/03/01 10:51:35.258 leader.go:65: [info] server is closed, return leader loop
2018/03/01 10:51:35.259 log.go:84: [info] etcdserver: [skipped leadership transfer for single member cluster]
2018/03/01 10:51:35.259 log.go:84: [info] etcdserver/api/v3rpc: [grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {127.0.0.1:2379 <nil>}]
2018/03/01 10:51:35.259 log.go:84: [info] etcdserver/api/v3rpc: [Failed to dial 127.0.0.1:2379: grpc: the connection is closing; please retry.]
2018/03/01 10:51:35.291 server.go:246: [info] close server
2018/03/01 10:51:35.291 main.go:89: [info] Got signal [15] to exit.
[1]  + 32718 done       ./pd-server
neil:bin/ (master) $ ./pd-server
2018/03/01 10:51:40.007 util.go:59: [info] Welcome to Placement Driver (PD).
2018/03/01 10:51:40.007 util.go:60: [info] Release Version: 0.9.0
2018/03/01 10:51:40.007 util.go:61: [info] Git Commit Hash: 651d0dd52a46b7990d0cd74d33f2f10194d46565
2018/03/01 10:51:40.007 util.go:62: [info] Git Branch: namespace
2018/03/01 10:51:40.007 util.go:63: [info] UTC Build Time:  2017-09-13 05:30:13
2018/03/01 10:51:40.007 metricutil.go:83: [info] disable Prometheus push client
2018/03/01 10:51:40.007 server.go:87: [info] PD config - Config({FlagSet:0xc4200771a0 Version:false ClientUrls:http://127.0.0.1:2379 PeerUrls:http://127.0.0.1:2380 AdvertiseClientUrls:http://127.0.0.1:2379 AdvertisePeerUrls:http://127.0.0.1:2380 Name:pd DataDir:default.pd InitialCluster:pd=http://127.0.0.1:2380 InitialClusterState:new Join: LeaderLease:3 Log:{Level: Format:text DisableTimestamp:false File:{Filename: LogRotate:true MaxSize:0 MaxDays:0 MaxBackups:0}} LogFileDeprecated: LogLevelDeprecated: TsoSaveInterval:3s Metric:{PushJob:pd PushAddress: PushInterval:0s} Schedule:{MaxSnapshotCount:3 MaxStoreDownTime:1h0m0s LeaderScheduleLimit:64 RegionScheduleLimit:12 ReplicaScheduleLimit:16} Replication:{MaxReplicas:3 LocationLabels:[]} QuotaBackendBytes:0 AutoCompactionRetention:1 TickInterval:500ms ElectionInterval:3s configFile: WarningMsgs:[] nextRetryDelay:1000000000 disableStrictReconfigCheck:false})
2018/03/01 10:51:40.010 server.go:114: [info] start embed etcd
2018/03/01 10:51:40 systime_mon.go:11: [info] start system time monitor 
2018/03/01 10:51:40.011 log.go:84: [info] embed: [listening for peers on  http://127.0.0.1:2380]
2018/03/01 10:51:40.011 log.go:84: [info] embed: [pprof is enabled under /debug/pprof]
2018/03/01 10:51:40.011 log.go:84: [info] embed: [listening for client requests on  127.0.0.1:2379]
2018/03/01 10:51:40.019 log.go:84: [info] etcdserver: [name = pd]
2018/03/01 10:51:40.020 log.go:84: [info] etcdserver: [data dir = default.pd]
2018/03/01 10:51:40.020 log.go:84: [info] etcdserver: [member dir = default.pd/member]
2018/03/01 10:51:40.020 log.go:84: [info] etcdserver: [heartbeat = 500ms]
2018/03/01 10:51:40.020 log.go:84: [info] etcdserver: [election = 3000ms]
2018/03/01 10:51:40.020 log.go:84: [info] etcdserver: [snapshot count = 100000]
2018/03/01 10:51:40.020 log.go:84: [info] etcdserver: [advertise client URLs = http://127.0.0.1:2379]
2018/03/01 10:51:40.020 log.go:84: [info] etcdserver: [restarting member b71f75320dc06a6c in cluster 1c45a069f3a1d796 at commit index 20]
2018/03/01 10:51:40.020 log.go:84: [info] raft: [b71f75320dc06a6c became follower at term 2]
2018/03/01 10:51:40.020 log.go:84: [info] raft: [newRaft b71f75320dc06a6c [peers: [], term: 2, commit: 20, applied: 0, lastindex: 20, lastterm: 2]]
2018/03/01 10:51:40.072 log.go:80: [warning] auth: [simple token is not cryptographically signed]
2018/03/01 10:51:40.113 log.go:84: [info] etcdserver: [starting server... [version: 3.2.4, cluster version: to_be_decided]]
2018/03/01 10:51:40.115 log.go:84: [info] etcdserver/membership: [added member b71f75320dc06a6c [http://127.0.0.1:2380] to cluster 1c45a069f3a1d796]
2018/03/01 10:51:40.116 etcdutil.go:62: [error] failed to get raft cluster member(s) from the given urls.
2018/03/01 10:51:40.116 server.go:129: [info] create etcd v3 client with endpoints [http://127.0.0.1:2379]
2018/03/01 10:51:40.116 log.go:82: [info] etcdserver/membership: [set the initial cluster version to 3.2]
2018/03/01 10:51:40.116 log.go:84: [info] etcdserver/api: [enabled capabilities for version 3.2]
2018/03/01 10:51:41.021 log.go:84: [info] raft: [b71f75320dc06a6c is starting a new election at term 2]
2018/03/01 10:51:41.021 log.go:84: [info] raft: [b71f75320dc06a6c became candidate at term 3]
2018/03/01 10:51:41.021 log.go:84: [info] raft: [b71f75320dc06a6c received MsgVoteResp from b71f75320dc06a6c at term 3]
2018/03/01 10:51:41.021 log.go:84: [info] raft: [b71f75320dc06a6c became leader at term 3]
2018/03/01 10:51:41.021 log.go:84: [info] raft: [raft.node: b71f75320dc06a6c elected leader b71f75320dc06a6c at term 3]
2018/03/01 10:51:41.039 log.go:84: [info] etcdserver: [published {Name:pd ClientURLs:[http://127.0.0.1:2379]} to cluster 1c45a069f3a1d796]
2018/03/01 10:51:41.039 log.go:84: [info] embed: [ready to serve client requests]
2018/03/01 10:51:41.040 log.go:82: [info] embed: [serving insecure client requests on 127.0.0.1:2379, this is strongly discouraged!]
2018/03/01 10:51:41.066 server.go:174: [info] init cluster id 66666666666
2018/03/01 10:51:41.250 cache.go:379: [info] load 0 stores cost 465.361µs
2018/03/01 10:51:41.251 cache.go:385: [info] load 0 regions cost 426.452µs
2018/03/01 10:51:41.251 coordinator.go:123: [info] coordinator: Start collect cluster information
2018/03/01 10:51:41.251 coordinator.go:126: [info] coordinator: Cluster information is prepared
2018/03/01 10:51:41.251 coordinator.go:136: [info] coordinator: Run scheduler
2018/03/01 10:51:41.252 tso.go:104: [info] sync and save timestamp: last 0001-01-01 00:00:00 +0000 UTC save 2018-03-01 10:51:44.251760951 +0800 CST
2018/03/01 10:51:41.252 leader.go:249: [info] PD cluster leader pd is ready to serve
^C2018/03/01 10:51:56.077 server.go:228: [info] closing server
2018/03/01 10:51:56.077 coordinator.go:277: [info] balance-hot-region-scheduler stopped: context canceled
2018/03/01 10:51:56.077 coordinator.go:277: [info] balance-region-scheduler stopped: context canceled
2018/03/01 10:51:56.077 coordinator.go:277: [info] balance-leader-scheduler stopped: context canceled
2018/03/01 10:51:56.077 leader.go:107: [error] campaign leader err github.com/pingcap/pd/server/leader.go:269: server closed
2018/03/01 10:51:56.078 leader.go:65: [info] server is closed, return leader loop
2018/03/01 10:51:56.078 log.go:84: [info] etcdserver: [skipped leadership transfer for single member cluster]
2018/03/01 10:51:56.078 log.go:84: [info] etcdserver/api/v3rpc: [grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {127.0.0.1:2379 <nil>}]
2018/03/01 10:51:56.078 log.go:84: [info] etcdserver/api/v3rpc: [Failed to dial 127.0.0.1:2379: grpc: the connection is closing; please retry.]
2018/03/01 10:51:56.118 server.go:246: [info] close server
2018/03/01 10:51:56.118 main.go:89: [info] Got signal [2] to exit.