在 Paxos 中提交失败

Commit Failure in Paxos

我是分布式系统和共识算法的新手。我理解它是如何工作的,但我对一些极端情况感到困惑:当接受者收到一个实例的接受但从未收到关于最终共识或决定是什么的回复时,接受者会做出什么反应。例如,提议者在提交期间或在发送所有 ACCEPT 后立即停止或失败。在这种情况下会发生什么?

谢谢。

There are two parts to this question: How do the acceptors react to new proposals? and How do acceptors react if they never learn the result?

In plain-old paxos, the acceptors never actually need to know the result. In fact it is perfectly reasonable that different acceptors have different values in their memory, never knowing if the value they have is the committed value.

The real point of paxos is to deal with the first question. And seeing that the acceptor never actually knows if it has the committed value, it has to assume that it could have the committed but be open to replacing its value if it doesn't have the committed value. How does it know? When receiving a message the proposer always compares the round number and if that is old then the acceptor signals to the proposer that it has to "catch up" first (a Nack). Otherwise, it trusts that the proposer knows what it is doing.


Now for a word about real systems. Some real paxos systems can get away with the acceptors not caring what the committed value is: Paxos is just there to choose what the value will be. But many real systems use Paxos & Friends to make redundant copies of the data for safekeeping.

Some paxos systems will continue paxos-ing until all the acceptors have the data. (Notice that without interference from other proposers, an extra paxos round copies the committed value everywhere.) Others systems are wary about interference from other proposers and will use a different Committed message that teach the acceptors (and other Learners) what the committed value is.

But what happens if the proposer crashes? A subsequent proposer can come along and propose a no-op value. If the subsequent proposer Prepares (Phase 1A) and can communicate with ANY of the acceptors that the prior proposer successfully sent Accepts to (Phase 2A) then it will know what the prior proposer was trying to do (via the response in Phase 1B: PrepareAck). Otherwise a harmless no-op value gets committed.

when the acceptors received an ACCEPT for an instance but never heard back about what the final consensus or decision is, [how] will the acceptors react.

发送值的节点通常通过计算对其 ACCEPT 消息的肯定响应直到它看到多数来获知其值是固定的。如果消息被丢弃,它们可以重新发送,直到有足够的消息通过以确定多数结果。接受者无需做任何事情,只要在发送重复消息时准确地遵循算法即可。

For example, the proposer is robooting or failed during commit or right after it sends all the ACCEPT. What will happen in this case?

这确实是一个有趣的案例。一个值可能会被大多数人接受,因此是固定的,但没有人知道,因为所有预定的消息都未能到达。

对 PREPARE 消息的响应包含有关已接受值的信息。因此任何节点都可以发出 PREPARE 消息并了解某个值是否已修复。这其实就是Paxos的高明之处。一旦一个值被大多数人接受,如果是固定的,因为任何节点 运行 算法必须在所有消息丢失和崩溃情况下继续选择相同的值。

通常 谁为具有连续值的连续轮流传输 ACCEPT 消息。如果领导者崩溃,任何节点都可以超时并尝试通过发送 PREPARE 消息来领导。发出 PREPARE 消息试图引导的多个节点可以相互中断,从而提供活锁。然而,一旦确定了什么价值,他们就永远不会不同意。他们只能通过竞争来固定自己的价值,直到有足够多的信息通过才能产生赢家。

再一次,当新的领导者从崩溃的领导者手中接管时,接受者节点除了遵循算法之外不需要做任何事情。该算法的不变量意味着任何领导者都不会与任何先前的领导者就固定值发生冲突。新领导与老领导合作,接受者可以简单地相信情况就是如此。最终,足够多的消息将通过所有节点以了解结果。