Consistency Guarantees

Notes

Distributed Systems

Linearizability

Definition

Terminology

Also known as atomic consistency, strong consistency, immediate consistency, or external consistency

Motivation

Gives the illusion that there is only one replica of the data

Definition

In a linearizable system we imagine that there must be some point in time (between the start and end of the write operation) at which the value of x atomically flips from 0 to 1. Thus, if one client’s read returns the new value 1, all subsequent reads must also return the new value, even if the write operation has not yet completed.
This model doesn’t assume any transaction isolation: another client may change a value at any time. An atomic compare-and-set (cas) operation can be used to check the value hasn’t been concurrently changed by another client

Linearizability vs Serializability

Serializability is an isolation property of transactions, which guarantees that transactions behave the same as if they had executed in some serial order
Linearizability is a recency guarantee on reads and writes of a register (an individual object). It doesn’t group operations together into transactions
A database may provide both serializability and linearizability, and this combination is known as strict serializability or strong one-copy serializability. Implementations of serializability based on two-phase locking or actual serial execution are typically linearizable. However, serializable snapshot isolation is not linearizable.

Relying on Linearizability

Locking and leader election

A system that uses single-leader replication needs to ensure that there is indeed only one leader. One way to elect a leader is to use a lock. Every node that starts up tries to acquire a lock, and the one that succeeds becomes to leader. No matter how this lock is implemented, it must be linearizable.
Coordination services like Apache ZooKeeper and etcd are often used to implement distributed locks and leader election. They use consensus algorithms to implement linearizable operations in a fault-tolerant way (see Fault-Tolerant Consensus)
Distributed locking is also used at a much more granular level in some distributed databases, such as Oracle Real Application Clusters (RAC). RAC uses a lock per disk page, with multiple nodes sharing access to the same disk storage system.

Constraints and uniqueness guarantees

Uniqueness: Uniqueness constraints are common in databases for example, usernames and filenames. If you want to enforce this constraint as the data is written, you need linearizability.
Constraints: Some system requires some constraints that always hold, for example a bank account balance never goes negative. Such constraints all require there to be a single up-to-date value that all nodes agree on. In real applications, it is sometimes acceptable to treat such constraints loosely, then linearizability may not be needed. However, a hard uniqueness constraint requires linearizability.

Cross-channel timing dependencies

When there are multiple communication channel (for example a image uploader has two communication channels: storing the image to the storage and sending a resizing job through a message queue), linearizability is required to prevent race condition (for example the resizer may see a old version of the image or noting at all when receiving the resizing job).

Implementation

Single-leader replication (potentially linearizable)

If you make reads from the leader, or from synchronously updated followers, they have the potential to be linearizable.
However, not every single-leader database is actually linearizable, either by design (e.g., because it uses snapshot isolation) or due to concurrency bugs
Using the leader for reads relies on the assumption that you know for sure who the leader is. It is quite possible for a node to think that it is the leader, in fact it is not—and if the delusional leader continues to serve requests, it is likely to violate linearizability.
With asynchronous replication, failover may even lose committed writes, which violates both durability and linearizability.

Consensus algorithms (linearizable)

Some consensus algorithms bear a resemblance to single-leader replication. However, consensus protocols contain measures to prevent split brain and stale replicas.
Zookeeper, and etcd can guarantee linearizability using consensus algorithms.

Multi-leader replication (not linearizable)

Systems with multi-leader replication are generally not linearizable, because they concurrently process writes on multiple nodes and asynchronously replicate them to other nodes. For this reason, they can produce conflicting writes that require resolution. Such conflicts are an artifact of the lack of a single copy of the data.

Cost

Ordering Guarantees

Causality

Casaulity

Sequence Number Ordering

Total order broadcast

Distributed Transaction and Consensus

Two-phase Commit transaction

Problems of distributed atomic commit

It not sufficient of simply send a commit request to all of the nodes and independently commit the transaction on each one. In doing so, it could happen that the commit succeeds on some nodes and fails on other nodes, which would violate the atomicity guarantee.

Two-phase commit

Uses a external coordinator, and database nodes are called participants.
When the application is ready to commit, the coordination begins phase 1: it sends a prepare request to each of the nodes, asking them whether they are able to commit. If all participants reply yes, then the coordinator sends out a commit request in phase 2; If any of the participants replies “no”, the coordinator sends an abort request to all nodes in phase 2.

Importants points

When a participant receives the prepare request, it makes sure that it can definitely commit the transaction under all circumstances. This includes writing all transaction data to disk, and checking for any conflicts or constraint violations. By replying “yes” to the coordination, the node promises to commit the transaction without error if requested.
When the coordination has received responses to all prepare requests, it makes a definitive decision on whether to commit or abort the transaction. The coordination must write that decision to its transaction log on disk so that it knows which way it decided in case it subsequently crashes. This is called the commit point.
Once the coordination’s decision has been written to disk, the commit or abort request is sent to all participants. If this request fails or times out, the coordinator must retry forever until it succeeds and there is no more going back.
Thus, the protocol contains two crucial “points of no return”: when a participant votes “yes,” it promises that it will definitely be able to commit later (although the coordinator may still choose to abort); and once the coordinator decides, that decision is irrevocable. Those promises ensure the atomicity of 2PC.

Unique transaction ID

When the application wants to begin a distributed transaction, it requests a transaction ID from the coordinator which globally uniquely identifies each transaction
When the coordinator and the participants communicate, they must attach the transaction ID.

Coordination failure

If the coordinator fails before sending the prepare requests, a participant can safely abort the transaction. But once the participant has received a prepare request and voted “yes,” it can no longer abort unilaterally—it must wait to hear back from the coordinator whether the transaction was committed or aborted. Thus if the coordination crashes, the only way to make progress is to wait for the coordinator to recover.
In principle, the participants could communicate among themselves to find out how each participant voted and come to some agreement, but that is not part of the 2PC protocol.

Three-phase commit and non-blocking atomic commit

Two-phase commit is called a blocking atomic commit protocol due to the fact that 2PC can become stuck waiting for the coordinator to recover
A nonblocking algorithm called three-phase commit has been proposed, but it assumes a netword bounded delay and nodes with bounded response time, which is not guaranteed in most practical systems.
In general, nonblocking atomic commit requires a perfect failure detector i.e., a reliable mechanism for telling whether a node has crashed or not. In a network with unbounded delay a timeout is not a reliable failure detector. For this reason, 2PC continues to be used.

Distributed Transactions in Practice

Problems with Distributed Transactions

Two-phase commit’ additional disk forcing (fsync) that is required for crash recovery and the additional network round-trips carry a heavy performace penalty. Distributed transactions in MySQL are reported to be over 10 times slower than single-node transactions

Homogeneous and heterogeneous distributed transactions

Some distributed databases (i.e., databases that use replication and partitioning in their standard configuration) support internal transactions among the nodes of that database.
In a heterogeneous transaction, the participants are two or more different technologies.
Database-internal can use any protocol and apply optimizations specific to that particular its storage engine, and thus it can often work quite well. On the other hand, transactions spanning heterogeneous technologies are a lot more challenging.

XA Transactions

Fault-Tolerant Consensus

Properties

Uniform agreement: Non two nodes decides differently. Every decides on the same outcome.
Integrity: No node decides twice. Once a node decided, it can change its mind.
Validity: If a node decides value v, then v was proposed by some node. This is used to rule out a trivial algorithm that always decides null no matter what was proposed.
Termination: Every node that does not crash eventually decides some value.

Fault-tolerance

Satisfying the first three properties alone is easy, we can just make one node to be the “dictator”, and let that node make all decisions. But then it is not fault-tolerant, if the “dictator” crashes, it can no longer make any progress.
The termination property formalizes the idea of fault tolerance. Even if some node fails, the other nodes must sill reach a decision.
If all nodes crashes it is impossible for them to decide anything, so there is a limit to the number of failures a algorithm can tolerant. It can be proved that at least a majority of nodes to be functioning correctly is required for any consensus algorithm to assure termination.
2PC depends on a single coordinator and thus is not fault-tolerant.

Consensus algorithms and total order broadcast

Most of consensus algorithms do not use the formal model that specifies proposing and deciding on a single value, but decides on a sequence of values, which make them total order broadcast algorithms.
Total order broadcast requires messages to be delivered exactly once, in the same order, to all nodes. This is equivalent to performing several rounds of consensus.

Leader election, epoch numbering, and quorums

Most consensus algorithm still uses some kind of leader. But instead of manually appoint a leader to break the termination property, it implements automatic leader election and failover.
However, electing a leader is also a choice that needed to made by the nodes, which causes a recursive problem. Consensus algorithms solve this problem by using a weaker form of leader by using an epoch number (called the ballot number in Paxos, view number in Viewstamped Replication, and term number in Raft). Epoch number are monotonically increasing, and each epoch is guaranteed to only have at most one leader. If there is a conflict between two different leaders in two different epochs, then the leader with the higher epoch number prevails.
Before a leader is allowed to decide anything, it must first check that there isn’t some other leader with a higher epoch number which might take a conflicting decision. It does this by collect votes from a quorum of nodes. The quorum typically, but not always, consists of a majority of nodes. The key insight is that the quorums for this vote must overlap with the quoram of the most recent leader election: if a vote on a proposal succeeds, at least one of the nodes that voted for it must have also participated in the most recent leader election. Because no higher-number was revealed during this process, the current leader can conclude that no leader election with a higher epoch number has happened, and therefore it still holds the leadership and can then safely decide the proposed value.

Limitations of consensus

The process by which nodes vote on proposals before they are decided is a kind of synchronous replication, but databases are often configured to use asynchronous replication for the sake of better performance.
Consensus systems always require a strict majority to operate. If a network failure cuts off some nodes from the rest, only the majority portion of the network can make progress, and the rest is blocked.
Most consensus algorithms assume a fixed set of nodes that participate in voting, which means that you can’t just add or remove nodes in the cluster. Dynamic membership extensions to consensus algorithms allow the set of nodes in the cluster to change over time, but they are much less well understood than static membership algorithms.
Consensus systems generally rely on timeouts to detect failed nodes. In environments with highly variable network delays, a node falsely believes the leader to have failed due to a transient network issue. Frequent leader elections may result in terrible performance.

Real-world cases

Viewstamped Replication: implement total order broadcast
Raft: implement total order broadcast
Zab: implement total order broadcast
Paxos: A optimization know as Multi-Paxos implements total order broadcast

Membership and coordination services

Important features

Linearizable atomic operations: Using an atomic compare-and-set operation, you can implement a lock: if several nodes concurrently try to perform the same operation, only one of them will succeed. The consensus protocol guarantees that the operation will be atomic and linearizable, even if a node fails or the network is interrupted at any point. A distributed lock is usually implemented as a lease, which has an expiry time so that it is eventually released in case the client fails

Partition Case Studies