Replication

Notes

Distributed Systems

Replication

Leader-based Replication

How it works

Read

Client can read data from both the leader and the followers

Write

Writes are only accepted on the leader
When the leader receive writing request, besides writing the data to its local storage, it also sends to data change to all its followers; then the followers applies all changes in the same order as they were processed on the leader

Real-world usage

Relational databases: PostgreSQL (since version 9.0), MySQL, Oracle Data Guard, SQL Server’s AlwaysOn Availability Groups…
Nonrelational databases: MongoDB, RethinkDB, Espresso…
Distributed message brokers: Kafka, RabbitMQ…
Network filesystems and replicated block devices: DRBD

Synchronous versus Asynchronous Replication

Asynchronous

The leader sends the message, but doesn’t wait for a response from the follower.

Synchronous

The leader waits until the follower has confirmed that it received the write before reporting success to the user, and before making the write visible to other clients
Advantage: The follower is guaranteed to have an up-to-date copy of the data that is consistent with the leader. If the leader suddenly fails, we can be sure that the data is still available on the follower.
Disadvantage: If the follower does not respond, the leader must wait for it to report success to the user and make the write visible, and take other writes. This sometimes can take a long time.

In practice: Semi-asynchronous are completely asynchronous

In practice, enabling synchronous replication usually means that one of the followers is synchronous, and the other s are asynchronous. This guarantees that at least on nodes that have an up-to-date copy.
Often leader-based replication is configured to be completely asynchronous

Setting up new followers

Take a consistent snapshot of the leader’s database at some point of take-if possible, without locking the entire database
copy the snapshot to Aqthe new follower node
The follower connects to the leader and requests all the data changes that have happened since the snapshot was taken.
When the follower has processed the backlog of data changes since the snapshot, we say it has caught up and it can now continue to process data changes from the leader as they happen.

Handling Node Outages

Follower failure: Catch-up recovery

Each follower keeps a log of the data changes it has received from the leader. When a follower need to recover form a crash or interrupted network, with the log, it knows the last transaction that was processed before the fault occurred, and then it can connect to the leader and request all the data changes that occurred during the time when the follower was disconnected

Leader failure

Process
1. Determining the leader has failed: most systems simply use a timeout: if a node doesn’t respond for some period of time, it is assumed to be dead.
2. Choosing a new leader
  - could be done through a election process, or a new leader could be appointed by a previously elected controller node
3. Reconfiguring the system to use the new leader: ensure that clients to send their write requests to the new leader, and if the older leader comes back, it becomes a follower and recognizes the new leader
Where it could go wrong

Implementation of Replication Logs

Statement-based replication

The leader logs every write request (statement) that it executes and sent that statement log to its followers
Drawbacks: Statements that carries nondeterminism may cause followers getting different results. Altough it is possible to fix these issues by replace any nondeterministic function calls with a fixed value, there are so many edges cases that other replication methods are now generally preferred.
Real-world usage
- MySQL used statement-based replication before version 5.1. It still sometimes uses it today, but it by default switches to row-based replication if there is any nondeterminism in a statement.
- VoltDB uses statement-based replication, and makes it safe by requiring transactions to be deterministic.

Write-ahead log shipping

Most database engine generally uses a write head log. Engines that use B-tree usually first write data to a write-head log for fault recovery, and log-structured storage engine use logs are the main storage place. Therefore the leader can send the exact same log to it followers, and the followers can build a exact same copy of out of it.
Disadvantages: Very low level, makes replication is closely coupled to storage engine. If the database changes its storage format from one version to another, it is typically not possible to run different versions of the database on the leader and the followers, and zero-downtime upgrades might not be applicable.
Real-world usage: PostgreSQL, Oracle

Logical (row-based) log replication

Use a logical log that decouples from the storage engine internals. Each record can insert a row, delete or update a uniquely identified row, and can indicate commitment of a transaction.
Advantages: More easily to be kept backward compatible, allowing leader and the follower to run different versions of the databases or even different storage engines.
Real-world usages: MySQL’s binlog.

Trigger-based replication

Problems of Replication Lag

Reading your own writes

With asynchronous replication, a problem could occur that a client writes data to the leader, and reads from the follower shortly after before the write has been sent to the follower, resulting in the client thinking the write it just made was lost.
To avoid this problem, we need a guarantee that a client should always see any updates thy submitted themselves. This is called read-after-write consistency or read-your-writes consistency
Implementations of read-your-writes consistency**
- When reading something that the user may have modified, read it from the leader; otherwise, read it from a follower. This won’t be effective if most things are potentially editable by the user.
- Track the time of the last update, and for one minute after the last update, make all reads from the leader.
- Remember the timestamp of its most recent write, then if the replica is not sufficiently up to date, the client should be redirected to another replica or wait until the replica has caught up. The timestamp could be a logical timestamp or the actual system clock (in which case clock synchronization becomes critical).
- If your replicas are distributed across multiple data centers, Any request that needs to be served by the leader must be routed to the leader.
Sometimes we may need cross-device read-after-your write consistency

Monotonic reads

Problem: users see things moving backward in time. This happens when a user makes two same queries from different replicas, but the second has a grater lag than the first one.
Monotonic reads: a guarantee that if one user makes several reads in sequence, they will not read older data after having previously read newer data.
Implementation: make sure that each user always makes their reads from the same replica (For example, the replica can be chosen based on a hash of the user ID)

Consistent prefix reads

Solutions

Multi-leader replication

Leaderless Replication

Basics

Client writes to and reads from multiple replicas

Client sends the write to multiple replicas in parallel, and once a specified number of replicas has responded ok, we consider the write to be successful.
Client reads from multiple replicas, determines the newest data by version numbers, and uses the newest data

Quorum reads and writes

If there are n replicas, every write must be confirmed by w nodes to be considered successful, and we must query at least r nodes form each read, then as long as w + r > n, we can ensure that we can get an up-to-date value
The value of n, w, and r are typically configurable. A common choice is to make n a odd number, and set w = r = (n + 1) / 2 (rounded up).

Read repair and anti-entropy process

Read repair: When client reads from multiple replicas, it can detect stale data, and writes the newer value back to the replica that has that stale data
Anti-entropy: A background process constantly looks for differences between replicas and copies any missing data from one replica to another. Unlike the replication log in leader-based replica, this process does not copy writes in any particular order, and there may be a significant delay before data is copied.

Limitations

Sloppy Quorums and Hinted Handoff

Multi-datacenter operation

Treat all nodes across all datacenters as n replicas

Implement multi-datacenter support within the normal leaderless model: the number of replicas n includes nodes in all datacenters
Each write from a client is sent to all replicas, regardless of datacenters.
The client usually only waits for local datacenter and writes to other datacenters are often configured to happen asynchronously
Examples: Canssandra, Voldemort

Treat nodes within one datacenter as n replicas

Keep all communication between clients and database nodes local to one datacenter, so n describes the number of replicas within one datacenter.
Cross-datacenter replication between database clusters happens asynchronously in the background, in a style that is similar to multi-leader replication.
Examples: Riak

Detecting Concurrent Writes

Examples

Dynamo, Riak, Cassandra, Voldemort

Partition