Adding a shard to your MongoDB cluster? Here’s what you need to know

Antonios Giannopoulos

7 years ago

“Let’s add another shard to your MongoDB cluster…”

If you’ve been an ObjectRocket customer for a while, you’ve probably heard our engineers say “we should add another shard to your cluster”. Or, maybe you’ve used our RocketScale™ mechanism that automatically adds shards to your clusters. In either case, adding a shard to a cluster is simple and only requires a click on the ObjectRocket dashboard. It’s what happens after the shard is added where things get a little bit more complicated.

In this blog, I describe what happens after a shard is added, generalizing (as much as possible) so that even if you aren’t an ObjectRocket customer (yet), you’ll learn how adding shards may cause other issues with your cluster and what we do to help avoid problems.

Why do you want to add a shard?

There are two main reasons to add a shard: to add more disk capacity to your cluster or to increase the performance. Sometimes it’s to increase both capacity and performance. This blog will focus on the increased capacity scenario. More specifically, I’ll explain what happens when a shard is added to a cluster and what can go wrong with data balancing between shards.

Collections

Collections on a sharded cluster fall into two categories: sharded and unsharded. A database on a sharded cluster may contain both types of collections.

A sharded collection has a shard key applied to it while an unsharded collection does not. As a result, a sharded collection’s data can exist on multiple shards while an unsharded collection’s data can only exist on one shard. This is considered the primary shard for the database collection.

How does a shard increase storage capacity?

When a new shard is added to a sharded cluster, every other shard will donate chunks to it. A chunk consists of a subset of sharded data of a sharded collection. Existing shards move data to the new shard and this reduces their data footprint. The process responsible for distributing the data is called a balancer. (We call this movement data balancing, most of the time.)

For example, let’s assume our dataset consists of one database and one sharded collection. We store our data to a shard, named S1, with a 50G disk.

When the capacity reaches close to 50G (48G in our example), we add another shard (S2) with the same specs. After some time*, both shards should hold 24G of data and indexes as S1 donates data to S2.
*Amount of time depends on balancer settings.

What if you have unsharded collections? In the above scenario, we assume that the 48G of data are distributed between a 10G unsharded collection and a 38G sharded collection.

Again, we add another shard (S2) with the same specs. One might assume S1 should hold 29G of data and indexes and S2 should hold 19G.

In reality, due to the way the data transfer between S1 and S2 works, the cluster will end up in the following state:

S1 will hold 48G on disk and S2 19G. In the simplest form, this is how the data transfer between S1 and S2 works:

S2 peaks a range of documents (chunk) based on the shard key from S1.
S2 opens a cursor to S1, pulls the documents and write them to its storage.
When step 2 completes, S1 alters the metadata and deletes the transferred documents.

Deleting documents

When a database deletes documents, it doesn’t free up the space on disk since this takes time and has performance consequences. Instead, it marks the record (the area the document is stored) as “deleted”. In later stages, it will try to reuse it. (Deleted records are also known as fragments.) Reuse comes at a cost so databases use algorithms to let them know when to reuse or append.

If you wish to “take your space back”, you can use the compact command or trigger an initial sync. Both methods rewrite the data and recreate the indexes so they will prevent fragmentation. Both methods lock the replica set node that they are running against, so it’s recommended to use one secondary at a time. You need to be careful that you have enough oplog to allow the locked secondary to catch up. Also, ensure you perform secondary reads that your application can operate with N-1 secondary. On initial sync (which doesn’t apply on compact), the secondary must fetch all data from another member, so you may notice some extra overhead. It’s better to run the initial sync during off-peak hours.

Shard key selection

Shard keys are, well, key. There is no such thing as a perfect shard key. However, the better your shard key is, the fewer problems you will have. Shard keys are their own huge topic, so to learn more about them, check out the following resources:

My presentation from Percona Live EU

Blog: Setting up a sharding infrastructure and scaling with MongoDB

Blog: Autokey: Automated shard key creation

Cluster balancing

Assuming you have a good shard key, the only issue you may face is cluster balancing the unsharded collections. Unsharded collections do not produce an even amount of data between two or more shards. Here’s what you can do about that: try to shard as many collections as you can. If some collections can’t be sharded for whatever reason, try to distribute the unsharded collections between the shards.

If you have a single database, the answer is simple: you can’t do it. Primary shards are defined for databases, so you can move databases and their underlying collections between shards but the same is not true for individual collections. If you have more than one database, you can run the movePrimary command and move a database to a different shard:

db.adminCommand({movePrimary:,to: })

The movePrimary command requires write downtime for the affected database. Downtime is required as the mongos’s are not aware of the move and they might write data to the old shard during while the move is happening. The movePrimary command doesn’t have a mechanism to catch and replay the changes after the command was initiated so you may lose writes.

When the command terminates (and before writes start again), it’s a good idea to force all mongos’s to reload to the config database using the flushRouterConfig command:

db.adminCommand({flushRouterConfig: 1})

Using this command will ensure that all mongos’s are aware of the metadata change you just made.

A poor shard key is changes everything. Note that some of the following cases may occur even with a good shard key, but it’s much easier to deal with the fallout if you have a good key.

Jumbo chunks

MongoDB has a process to split chunks when they exceed a configurable size. The goal is to keep all chunks the same size. A jumbo chunk is a chunk that exceeds the configurable size. The balancer is unable to move jumbo chunks. If the balancer attempts to move a jumbo chunk, it will fail, marking it with {jumbo:true} in the config.chunks collection and it will never attempt to move the chunk again. Instead, the balancer will peak a different shard that can be moved.

Since jumbos are bigger, they tend to get stuck on a shard. That shard may hold more data that the others. The balancer doesn’t take into account the size of the chunks at all. It just tries to make sure all shards have the same amount of chunks.

For example, let’s say you have 100 chunks on S1 and 50 are jumbos. When you add S2, the balancer will move the 50 non-jumbo chunks from S1 to S2 leaving S1 with more data. Jumbo chunks can become a big issue when MongoDB is unable to split. Jumbo chunks happen, even with a perfect shard key. To deal with this situation, you have to split the chunks manually using the following command:

db.adminCommand({"split":"incollection","bounds": bounds:[{:},{ : } ] } )

What happens when you can’t split a chunk?

As an example, let’s say that your business stores digital invoices for your customers. You shard the collection for invoices using the customer identifier (since all of your queries are using it). You start to attract small customers that issue few invoices daily. Everything is working great until one day, a huge retail store also wants to use your service. This retailer stores thousands of invoices per day. Suddenly, you realize that your shard key is not going to work that well anymore. The chunk that contains the big retail company will grow and grow and you will be unable to split. The only solution, in this case, is to change the shard key.

Empty chunks

Since you’ve learned your lesson, you change the shard key to customer_id, invoice_id, a unique combo which will always produce split points. After a few months, some customers churn and you delete their data to save space. You delete based on customer_id, which is very likely to leave some chunks without documents. The empty chunks affect your balancing, as the balancer wants every shard to have the same amount of chunks and not the same amount of data. In extreme cases, you may end up with empty shards. In this case, one solution is to merge the empty chunks:

db.runCommand({mergeChunks:,bounds:[{:},{ : } ] } )

Another solution is to change the shard key.

Orphan documents

Orphaned documents are documents on a shard that also exist in chunks on other shards due to failed migrations or incomplete migration cleanup. S2 copies data from S1 and then S1 deletes the copied data. The copy and delete commands live in RAM. If during the copy or delete process the S1 or S2 gets restarted, these commands are lost and shards can’t clean up the documents that have already transferred or were pending deletion. These documents are not visible (via primary reads) but they contribute to storage and can cause an imbalance.

Orphan documents can happen with either a good or poor shard key. Most probably happen with a good one, because with a poor shard key the balancer might not be able to move any chunks at all. MongoDB offers the cleanupOrphaned command which can clean up orphaned documents. It’s also highly advisable to monitor for service restarts or step-downs and to check if there was an active migration near that time.

We’re here to help scale your MongoDB instances

Adding a shard is easy. It’s ensuring that the cluster is balanced and that you’re using your new space wisely that are the tricky parts. Our experienced MongoDB DBAs can help you add shards and seamlessly scale your clusters to petabytes and beyond. We help manage the largest MongoDB cluster on the planet and love helping our customers get the most out of their database.

Give us a try free for 30 days. Our scaling expertise is included.