在 PAXOS 或 RAFT 中重新上线的副本如何赶上?
How do replicas coming back online in PAXOS or RAFT catch up?
在 PAXOS 和 RAFT 等共识算法中,会提出一个值,如果法定人数同意,就会将其持久写入数据存储。在法定人数时不可用的参与者会怎样?他们最终如何赶上?这似乎是留给 reader 的练习。
看看 Raft 协议。它只是内置在算法中。如果 leader 跟踪最高索引 (matchIndex
) 并将 nextIndex
发送给每个 follower,并且 leader 始终向每个 follower 从该 follower 的 nextIndex
开始发送条目,则没有特殊情况需要处理追赶提交条目时丢失的追随者。就其本质而言,当重新启动时,领导者将始终从其日志中的最后一个条目开始向该跟随者发送条目。这样节点就被追上了。
有了原来的Paxos论文,确实留作reader的习题了。实际上,使用 Paxos,您可以发送额外的消息,例如否定确认,以在集群周围传播更多信息,作为性能优化。这可用于让节点知道它由于丢失消息而落后。一旦一个节点知道它落后了,它就需要赶上来,这可以通过额外的消息类型来完成。这被描述为 Retransmission in the Trex multi-paxos engine that I wrote to demystify Paxos。
Google 胖乎乎的 paxos 论文Paxos Made Live criticises Paxos for leaving a lot up to the people doing the implementation. Lamport trained as a mathematician and was attempting to mathematically prove that you couldn't have consensus over lossy networks when he found the solution. The original papers are very much supplying a proof it is possible rather than explaining how to build practical systems with it. Modern papers usually describe an application of some new techniques backed up by some experimental results, while they also supply a formal proof, IMHO most people skip over it and take it on trust. The unapproachable way that Paxos was introduced means that many people who quote the original paper but have failed to see that they describe 。不幸的是,Paxos 仍然以理论的方式教授,而不是现代风格,这让人认为它很难并且错过了它的本质。
我认为 Paxos is simple 但是关于分布式系统中的故障的推理和测试以发现任何错误是困难的。原始论文中留在 reader 中的所有内容都不会影响正确性,但会影响延迟、吞吐量和代码的复杂性。一旦您理解了是什么使 Paxos 正确,因为它在机械上很简单,那么当您针对您的用例和工作负载优化代码时,就可以直接以不违反一致性的方式编写所需的其余部分。
例如,Corfu and CURP 提供了非常高的性能,一个仅对元数据使用 Paxos,另一个仅在对相同的键进行并发写入时才需要执行 Paxos。这些解决方案不直接使用 Raft 或 Multi-Paxos 完成,因为它们解决特定的高性能场景(例如,k-v 存储)。然而,他们证明值得理解的是,对于实际应用程序,如果您的特定工作负载允许您同时仍然将 Paxos 用于整体解决方案的某些部分,那么您可以进行大量优化。
Paxos made Simple中提到了这一点:
Because of message loss, a value could be chosen with no learner ever finding out. The learner could ask the acceptors what proposals they have accepted, but failure of an acceptor could make it impossible to know whether or not a majority had accepted a particular proposal. In that case, learners will find out what value is chosen only when a new proposal is chosen. If a learner needs to know whether a value has been chosen, it can have a proposer issue a proposal, using the algorithm described above.
还有在 Raft 论文中:
The leader maintains a nextIndex for each follower, which is the index of the next log entry the leader will send to that follower.
If a follower’s log is inconsistent with the leader’s, the AppendEntries consistency check will fail in the next AppendEntries RPC. After a rejection, the leader decrements nextIndex and retries the AppendEntries RPC. Eventually nextIndex will reach a point where the leader and follower logs match. When this happens, AppendEntries will succeed, which removes any conflicting entries in the follower’s log and appends entries from the leader’s log (if any).
If a follower or candidate crashes, then future RequestVote and AppendEntries RPCs sent to it will fail. Raft handles these failures by retrying indefinitely; if the crashed server restarts, then the RPC will complete successfully.
在 PAXOS 和 RAFT 等共识算法中,会提出一个值,如果法定人数同意,就会将其持久写入数据存储。在法定人数时不可用的参与者会怎样?他们最终如何赶上?这似乎是留给 reader 的练习。
看看 Raft 协议。它只是内置在算法中。如果 leader 跟踪最高索引 (matchIndex
) 并将 nextIndex
发送给每个 follower,并且 leader 始终向每个 follower 从该 follower 的 nextIndex
开始发送条目,则没有特殊情况需要处理追赶提交条目时丢失的追随者。就其本质而言,当重新启动时,领导者将始终从其日志中的最后一个条目开始向该跟随者发送条目。这样节点就被追上了。
有了原来的Paxos论文,确实留作reader的习题了。实际上,使用 Paxos,您可以发送额外的消息,例如否定确认,以在集群周围传播更多信息,作为性能优化。这可用于让节点知道它由于丢失消息而落后。一旦一个节点知道它落后了,它就需要赶上来,这可以通过额外的消息类型来完成。这被描述为 Retransmission in the Trex multi-paxos engine that I wrote to demystify Paxos。
Google 胖乎乎的 paxos 论文Paxos Made Live criticises Paxos for leaving a lot up to the people doing the implementation. Lamport trained as a mathematician and was attempting to mathematically prove that you couldn't have consensus over lossy networks when he found the solution. The original papers are very much supplying a proof it is possible rather than explaining how to build practical systems with it. Modern papers usually describe an application of some new techniques backed up by some experimental results, while they also supply a formal proof, IMHO most people skip over it and take it on trust. The unapproachable way that Paxos was introduced means that many people who quote the original paper but have failed to see that they describe
我认为 Paxos is simple 但是关于分布式系统中的故障的推理和测试以发现任何错误是困难的。原始论文中留在 reader 中的所有内容都不会影响正确性,但会影响延迟、吞吐量和代码的复杂性。一旦您理解了是什么使 Paxos 正确,因为它在机械上很简单,那么当您针对您的用例和工作负载优化代码时,就可以直接以不违反一致性的方式编写所需的其余部分。
例如,Corfu and CURP 提供了非常高的性能,一个仅对元数据使用 Paxos,另一个仅在对相同的键进行并发写入时才需要执行 Paxos。这些解决方案不直接使用 Raft 或 Multi-Paxos 完成,因为它们解决特定的高性能场景(例如,k-v 存储)。然而,他们证明值得理解的是,对于实际应用程序,如果您的特定工作负载允许您同时仍然将 Paxos 用于整体解决方案的某些部分,那么您可以进行大量优化。
Paxos made Simple中提到了这一点:
Because of message loss, a value could be chosen with no learner ever finding out. The learner could ask the acceptors what proposals they have accepted, but failure of an acceptor could make it impossible to know whether or not a majority had accepted a particular proposal. In that case, learners will find out what value is chosen only when a new proposal is chosen. If a learner needs to know whether a value has been chosen, it can have a proposer issue a proposal, using the algorithm described above.
还有在 Raft 论文中:
The leader maintains a nextIndex for each follower, which is the index of the next log entry the leader will send to that follower.
If a follower’s log is inconsistent with the leader’s, the AppendEntries consistency check will fail in the next AppendEntries RPC. After a rejection, the leader decrements nextIndex and retries the AppendEntries RPC. Eventually nextIndex will reach a point where the leader and follower logs match. When this happens, AppendEntries will succeed, which removes any conflicting entries in the follower’s log and appends entries from the leader’s log (if any).
If a follower or candidate crashes, then future RequestVote and AppendEntries RPCs sent to it will fail. Raft handles these failures by retrying indefinitely; if the crashed server restarts, then the RPC will complete successfully.