MongoDB Sharding Guide: Overview
During the past few months, I’ve been diving into the MongoDB world and have really been enjoying it. It has been a nice breath of fresh air and I can agree that MongoDB is to databases what Rails was to frameworks. I’m not going to dive into the basics of MongoDB in this guide on sharding, but instead will describe setting up a basic production cluster in the cloud. There are numerous server configurations to accomplish a sharded cluster, and I urge you to explore and comment on different configurations that have worked for you. I will be explaining my own configuration and how to get it all running in this guide.
Some might wonder, what the different benefits and use cases for a sharded setup? I’m not going to dive into the specifics that are outlined in the sharding introduction, but instead discuss a few specific use cases that made me jump into a sharded configuration.
Needing more capacity
What happens when your indexes fill up all available memory or the data drive fills up? Well, I could obviously upgrade to a bigger instance and scale vertically. Instead, why not just throw another shard at it and let Mongo balance the data? Win!
What happens when you need to speed up your reporting aggregations? One could move to a complex Hadoop setup. Alternatively, why not throw in more shards? Mongo’s map/reduce implementation currently has a limitation of being single threaded. However, map/reduces will run in parallel across all shards if the collection you are aggregating against is sharded. With that parallel processing available, we can simply throw more shards in and let our aggregations fly. If you have complex aggregations or many of them, you can even go as far as having a large quantity of smaller shards and increase your level of concurrency. It really all depends on your specific needs.
My basic server layout consists of the following:
- 2 shard servers (large EC2 instances)
- 3 configuration servers (micro EC2 instances)
- 1 routing/application server (micro EC2 instance)
It is recommended to use three configuration servers for production use, but you can get away with one if you are testing. I am keeping everything running on separate instances for organizational reasons, but you can also share resources (due to the light load of the configuration and routing processes) and use a similar layout to the one shown on the MongoDB Sharding Introduction docs.
Our plan for setting up our cluster will involve the following steps:
Once this these steps are complete, scaling our cluster will be as simple as building another shard server/replication set and adding it to our configuration database. To make your life even easier, I will be providing simple build scripts on github to help automate the server build process. Once you add the new shard, Mongo takes care of the rest. When you are ready to get started, jump ahead and read the next post in this series: Server Setup on Ubuntu + EC2.