Partition
Partitioning Key
Partitioning by key range
How
- Assign each partition a continuous range of keys
- If you know the boundaries between the ranges, then you can determine which partition contains a given key
- The partition boundaries can be chosen manually by an administrator, and can also be chosen automatically by the database
Pros and Cons
- Pros
- Cons
- Certain access patterns may lead to hot spots. For example, if the key is timestamp, then writes might end up going to the same partition.
Applications
- BigTable, and its open source equivalent HBase
- Rethink DB
- MongoDB before 2.4
Partitioning by hash
How
- Use a hash function to determine the partition for a given key.
- For partitioning purposes, the hash function need not be cryptographically strong.
- Some programming’s built-in hash functions may not be suitable for partitioning: for example, in Java’s Object.hashCode() and Ruby’s Object#hash, the same key may have a different hash value in different processes
Pros and Cons
- (Pro) Distribute keys fairly among the parititions
- (Cons) Inefficient to do range queries on the partitioning key
Examples
- Hash functions
- MD5: Cassandra and MongoDB use MD5
- Fowler-Noll-Vo: Voldemort
- Ranges queries
- Not supported for primary key: Riak, Couchbase, or Voldemort
- Sent to all paritions: MongoDB
Partitioning and Secondary Indexes
Partitioning Secondary Indexes by Document
How
- Partition the database by the ID that uniquely identifies a record
- Each partition maintains its own secondary indexes, covering only the documents in that partition
Pros and Cons
- Cons: Need to query all partitions
Applications
- MongoDB, Riak, Cassandra, Elasticsearch, SolrCloud, and VoltDB
Partitioning Secondary Indexes by Term
How
- Maintain a global secondary index that covers data in all partitions
- Partition the global secondary by a term different from the primary index
Pros and Cons
- Pros: Reads are more efficient because we can only query partitions that containing the term we want
- Cons: Writes are slower and more complicated because a write to a single document may now affect multiple partitions of the index
Other issues
- Updates to global secondary indexes are often asynchronous
Rebalancing Partitions
Requirements
- While rebalancing, the data moved between nodes should be minimized
- While rebalancing, the database should continue accepting reads and writes
- After rebalancing, the load (data storage, read and write requests) should be shared fairly between the nodes in the cluster
Rebalancing startegies
(Don’t) hash mode N
- When N changes, most of the keys will need to be moved from one to another
Fixed number of partitions
- Create many more partitions than there are nodes
- Assign keys to partitions. This assignment will not change.
- Assign several partitions to each node. This assignment will change when rebalancing happens.
- When adding new node, the new node can steal a few partitions from every existing node until partitions are fairly distributed once again.
- When a node is removed from the cluster, the paritions assigned to the node will be fairly distributed to exisiting nodes.
Dynamic Partitioning
Partitioning proportionally to nodes
Examples
- Fixed number of paritions: Riak, Elasticsearch, Couchbase, Voldemort
Automatic or Manual Rebalancing
(Don’t do) Fully automated
- Most convenient, but could be problematic since the behaviour can be unpredictable and rebalancing is expensive
- It can be a good thing to have a human in the loop for rebalancing. It’s slower than a fully automatic process, but it can help prevent operational surprises.
Semi-automated
- The system generate a suggested partition assignment automatically, but require an administrator to commit it before it takes effect.
Fully manual
Examples
- Semi-automated: Couchbase, Riak, and Voldemort
Request Routing
What component makes the routing decision
Any node
- Sending the request to any node, and the node forwards the request to the node that should handle the request.
Routing tier
- Sending requests to a routing layer, and the routing layer forwards the request to the node that is responsible for handling the request.
Client
- The client send the partition directly to the appropriate node
How to make routing decision
- Many distributed data systems rely on a separate coordination service such as ZooKeeper to maintain the authoritative mapping of partitions to nodes.
Real-world examples
- HBase, SolrCloud, Kafka: use ZooKeeper to track partition assignment.
- LinkedIn’s Espresso: uses Helix for cluster management (which in turn relies on ZooKeeper)
- MongoDB: relies on its own config server implementation and mongos daemons as the routing tier.
- Cassandra, Riak: use a gossip protocol among the nodes to disseminate any changes in cluster state. Requests can be sent to any node, and that node forwards them to the appropriate node for the requested partition
- Couchbase: does not rebalance automatically, and is normally is configured with a routing tier called moxi, which learns about routing changes from the cluster nodes
Real-World examples
MySQL