An intro to sharding ((horizontal) scalability) & replica sets (durability)

Recommended Reading

Seven Databases in Seven Weeks (pp 165-173)
Scaling MongoDB (pp All!)
MongoDB: The Definitive Guide (Chapters 9-10)
MongoDB In Action (Chapters 8-9)

Intro to Sharding & Replication

What makes document DBs unique is their ability to efficiently handle arbitrarily nested, schemaless ‘documents’. What makes MongoDB special among document DBs is its ability to scale across several servers, by replicating (copying data to other servers) or sharding collections (splitting a collection into pieces) and performing queries in parallel. Both promote availability. More background here

Sharding has a cost: if one part of the collection is lost, the whole thing is compromised – what good is querying a database of world country facts if the southern hemisphere is down? Mongo deals with this implicit sharding weakness – via duplication.

Replication improves durability, but consequently generates issues not found in single-source databases. One problem is deciding which node gets promoted when a master node goes down. Mongo deals with this by giving each mongod service a vote, and the one with the freshest data is elected the new master.

The problem with even nodes
The concept of replication is simple enough. You write to one MongoDB server and that data is duplicated to others within the replica set. If one server is unavailable, then one of the others can be promoted and handle requests.
– MongoDB demands an odd number of total nodes in the replica set.

To see why an odd number of nodes is best, consider what might happen with a 4-node replica set. Let’s imagine two of the servers lose connectivity to the other two. One set will have the original master, but since it can’t see a clear majority of the network (2 vs 2) the master demotes. The other set also won’t be able to elect a master as it too can’t communicate with a clear majority of nodes. Both sets become unable to process requests and the system is effectively down.


CouchDB, on the other hand allows multiple masters. Unlike Riak, MongoDB always knows the most recent data. Mongo’s concern is strong consistency on writes and a) preventing a multi-master scenario and b) Understanding the primacy of data spread across nodes are both strategies to achieve this.

SHARDING facilitates handling very large datasets and extensibility aka horizontal scaling.

Rather than having a single server holding all the data, sharding splits the data onto other servers – I tend to think about database statistics within RDBMS engines here. Perhaps the first 3 chunks of data on the histogram on the left might go on shard 1; the 4th chunk on shard 2 and chunks 5, 6 & 7 on shard 3.

A shard is one or more servers in a cluster that are responsible for some subset of the data. If we had a cluster containing 1,000,000 documents representing users of a website, on shard might contain information about 200,000 of those users.

To evenly spread data across shards, MongoDB move subsets of the data from shard to shard. It figures out which subset(s) to move where based on a key that you choose. For example, we might choose to split up a collection of users based on the username field. MongoDB used range-based splitting: that is, data is split into chunks of given ranges eg [“a”, “d”)

A shard can consist of many servers. If there is more than one server in a shard, each server has an identical copy of the subset of data. In production, a shard will usually be a replica set.

Sharding requires some kind of library/lookup. Imagine we created a table to store city names alphabetically. We need some way to know that eg cities starting A-D go to server mongo2, P-Z on mongo4 etc. In MongoDB you create a config server. Certain versions(?) of MongoDB make this easier by autosharding, managing the spread of data for you.


Distribution, Replica sets, Master-Slave

I’m just getting to grips with distributed database terminology pp130 in “MongoDB The Definitive Guide”

Master-Slave Replication

Master-slave replication  can be used for

  • Backup
  • Failover
  • Read scaling, and more

Read scaling could be really interesting from a BI consumption point-of-view eg a peak of user traffic to the DW portal at the beginning of the day.

In MongoDB the most basic setup is to start a master node (in my case the Dell 5100)  and add one or more slave nodes (in my case, the Raspberry Pis) Each of the slaves must know the address of the master.

To start the master run: mongod –master

To start a slave run: mongod –slave –source master_address
– where master_address is the address of the master node just started. This is where my previous post on fixing a static IP comes in handy.

First, create a directory for the master to store data in and choose a port (10000)

$ mkdir -p ~/dbs/master
$ ./mongod --dbpath ~/dbs/master --port 10000 --master

Now, set up the slave(s) choosing a different data directory (if on same/virtual machine) and port. For any slave, you also need to specify who the master is

$ mkdir -p ~/dbs/slave
$ ./mongod --dbpath ~/dbs/slave --port 10001 --slave --source localhost:10000

All slaves must be replicated from a master node. It is not possible to replicate from slave to slave.

…As soon as my 32MB SD cards arrive in the post and I install MongoDB on the remaining 4 R PIs I will give this a go!

Replica Sets

replica set is the same as the above but with automatic failover. The biggest difference between a M-S cluster and a replica set is that a replica set does not have a single master.

One is ‘elected’ by the cluster and may change to another node if the then current master becomes uncontactable.


There are 3 different types of nodes which can co-exist in a MongoDB cluster

  1. Standard – Stores a complete, full copy of the data being replicated, takes part in the voting when the primary node is being elected and is capable of being the primary node in the cluster
  2. Passive – As above, but will never become the primary node for the set
  3. Arbiter – Participates only in voting. Does not receive any of the data being replicated and cannot become the primary node

Standard and passive nodes are configured using the priority key. A node with priority 0 is passive and will never be selected as primary.

I suppose in my case if I had decreasing SD cards, or a mix of model A (256 MB) and model B (512 MB) Pis, then I could set priorities in decreasing order so the weakest node was never the master and always selected last/lowest priority.

Initializing a set (pp132)