Distributed Systems

Partition

Partition

Partitioning Key

Partitioning by key range

How

Assign each partition a continuous range of keys
If you know the boundaries between the ranges, then you can determine which partition contains a given key
The partition boundaries can be chosen manually by an administrator, and can also be chosen automatically by the database

Pros and Cons

Pros
- Range scans are easy
Cons
- Certain access patterns may lead to hot spots. For example, if the key is timestamp, then writes might end up going to the same partition.

Applications

BigTable, and its open source equivalent HBase
Rethink DB
MongoDB before 2.4

Partitioning by hash

How

Use a hash function to determine the partition for a given key.
For partitioning purposes, the hash function need not be cryptographically strong.
Some programming’s built-in hash functions may not be suitable for partitioning: for example, in Java’s Object.hashCode() and Ruby’s Object#hash, the same key may have a different hash value in different processes

Pros and Cons

(Pro) Distribute keys fairly among the parititions
(Cons) Inefficient to do range queries on the partitioning key

Examples

Hash functions
- MD5: Cassandra and MongoDB use MD5
- Fowler-Noll-Vo: Voldemort
Ranges queries
- Not supported for primary key: Riak, Couchbase, or Voldemort
- Sent to all paritions: MongoDB

Partitioning and Secondary Indexes

Partitioning Secondary Indexes by Document

How

Partition the database by the ID that uniquely identifies a record
Each partition maintains its own secondary indexes, covering only the documents in that partition

Pros and Cons

Cons: Need to query all partitions

Applications

MongoDB, Riak, Cassandra, Elasticsearch, SolrCloud, and VoltDB

Partitioning Secondary Indexes by Term

How

Maintain a global secondary index that covers data in all partitions
Partition the global secondary by a term different from the primary index

Pros and Cons

Pros: Reads are more efficient because we can only query partitions that containing the term we want
Cons: Writes are slower and more complicated because a write to a single document may now affect multiple partitions of the index

Other issues

Updates to global secondary indexes are often asynchronous

Rebalancing Partitions

Requirements

While rebalancing, the data moved between nodes should be minimized
While rebalancing, the database should continue accepting reads and writes
After rebalancing, the load (data storage, read and write requests) should be shared fairly between the nodes in the cluster

Rebalancing startegies

(Don’t) hash mode N

When N changes, most of the keys will need to be moved from one to another

Fixed number of partitions

Create many more partitions than there are nodes
Assign keys to partitions. This assignment will not change.
Assign several partitions to each node. This assignment will change when rebalancing happens.
When adding new node, the new node can steal a few partitions from every existing node until partitions are fairly distributed once again.
When a node is removed from the cluster, the paritions assigned to the node will be fairly distributed to exisiting nodes.

Dynamic Partitioning

Partitioning proportionally to nodes

Examples

Fixed number of paritions: Riak, Elasticsearch, Couchbase, Voldemort

Automatic or Manual Rebalancing

(Don’t do) Fully automated

Most convenient, but could be problematic since the behaviour can be unpredictable and rebalancing is expensive
It can be a good thing to have a human in the loop for rebalancing. It’s slower than a fully automatic process, but it can help prevent operational surprises.

Semi-automated

The system generate a suggested partition assignment automatically, but require an administrator to commit it before it takes effect.

Fully manual

Examples

Semi-automated: Couchbase, Riak, and Voldemort

Request Routing

What component makes the routing decision

Any node

Sending the request to any node, and the node forwards the request to the node that should handle the request.

Routing tier

Sending requests to a routing layer, and the routing layer forwards the request to the node that is responsible for handling the request.

Client

The client send the partition directly to the appropriate node

How to make routing decision

Many distributed data systems rely on a separate coordination service such as ZooKeeper to maintain the authoritative mapping of partitions to nodes.

Real-world examples

HBase, SolrCloud, Kafka: use ZooKeeper to track partition assignment.
LinkedIn’s Espresso: uses Helix for cluster management (which in turn relies on ZooKeeper)
MongoDB: relies on its own config server implementation and mongos daemons as the routing tier.
Cassandra, Riak: use a gossip protocol among the nodes to disseminate any changes in cluster state. Requests can be sent to any node, and that node forwards them to the appropriate node for the requested partition
Couchbase: does not rebalance automatically, and is normally is configured with a routing tier called moxi, which learns about routing changes from the cluster nodes

Real-World examples

MySQL

Replication Consistency Guarantees