how to

Strategies for reducing wasted space in MongoDB

By September 9, 2014 July 8th, 2019 No Comments
ObjectRocket skyline

Appboy is the world’s leading marketing automation platform for mobile apps. We collect billions of data points each month by tracking what users are doing in our customers’ mobile apps and allowing them to target users for emails, push notifications and in-app messages based on their behavior or demographics. MongoDB powers most of our database stack, and we host dozens of shards across multiple clusters at ObjectRocket.

One common performance optimization strategy with MongoDB is to use short field names in documents. That is, instead of creating a document that looks like this…

{first_name: "Jon", last_name: "Hyman"}

…Use shorter field names so that the document might look like…

{fn: "Jon", ln: "Hyman"}

Since MongoDB doesn’t have a concept of columns or predefined schemas, this structure is advantageous because field names are duplicated on every document in the database. If you have one million documents that each have a “first_name” field on them, you’re storing that string a million times. This leads to more space per document, which ultimately impacts how many documents can fit in memory and, at large scale, may slightly impact performance, as MongoDB has to map documents into memory as it reads them.

In addition to collecting event data, Appboy also lets our customers store what we call “custom attributes” on each of their users. As an example, a sports app might want to store a user’s “Favorite Player,” while a magazine or newspaper app might store whether or not a customer is an “Annual Subscriber.” At Appboy, we have a document for each end user of an app that we track, and on it we store those custom attributes alongside fields such as their first or last name. To save space and improve performance, we shorten the field names of everything we store on the document. For the fields we know in advance (such as first name, email, gender, etc.) we can do our own aliasing (e.g., “fn” means “first name”), but we can’t predict the names of custom attributes that our customers will record. If a customer decided to make a custom attribute named “supercalifragilisticexpialidocious,” we don’t want to store that on all their documents.

To solve this, we tokenize the custom attribute field names using what we call a “name store.” Effectively, it’s a document in MongoDB that maps values such as “Favorite Player” to a unique, predictable, very short string. We can generate this map using only MongoDB’s atomic operators

The name store document schema is extremely basic: there is one document for each customer, and each document only has one array field named “list.” The idea is that the array will contain all the values for the custom attributes and the index of a given string will be its token. So if we want to translate “Favorite Player” into a short, predictable field name, we simply check “list” to see where it is in the array. If it is not there, we can issue an atomic push to add the element to the end of the array, (db.custom_attribute_name_stores.update({_id : X, list: {$ne : “Favorite Player”}}, {$push: {list: “Favorite Player”}})), reload the document and determine the index. Ideally, we would have used $addToSet, but $addToSet does not guarantee ordering, whereas $push is documented to append to the end by default.

So at this point, we can translate something like “Favorite Player” into an integer value. Say that value is 1. Then our user document would look like:

{
  fn: "Jon", 
  ln: "Hyman",
  custom: {
    1: "LeBron James"
  }
}

Field names are short and tidy! One great side effect of this is that we don’t have to worry about our customers using characters that MongoDB can’t support without escaping, such as dollar signs or periods.

Now, you might be thinking that MongoDB cautions against constantly growing documents and that our name store document can grow unbounded. In practice, we have extended our implementation slightly so we can store more than one document per customer. This lets us put a reasonable cap on how many array elements we allow before generating a new document. The best part is that we can still do all this atomically using only MongoDB! To achieve this, we add another field to each document called “leastvalue.” The “leastvalue” field represents how many elements have been added to previous documents before this one was created. So if we see a document with “leastvalue” 100 and a “list” of [“Season Ticket Holder”, “Favorite Player”], then the token value for “Favorite Player” is 101 (we’re using zero-based indexing). In this example, we are only storing 100 values in the “list” array before creating a new document. Now, when inserting, we modify the push slightly to operate on the document with the highest “leastvalue” value, and also ensure that “list.99” does not exist (meaning that there is nothing in index 99 in the “list” array). If an element already exists at that index, the push operation will do nothing. In that case, we know we need to create a new name store document with a “least_value” equal to the total number of elements that exist across all the documents. Using an atomic $findAndModify, we can create the new document if it does not exist, fetch it back and then retry the $push again.

If our customer has more than just a few custom attributes, reading back all the name store documents to translate from values to tokens can be expensive in terms of bandwidth and processing. However, since the token value of a given field is always the same once it has been computed, we cache the tokens to speed up the translation.

We’ve applied the “name store token” paradigm in various parts of our application to cut down on field name sizes while continuing to use a flexible schema. It can also be helpful for values. Let’s say that a radio station app stores a custom attribute that is an array of the top 50 performing artists that a user listens to. Instead of having an array with 50 strings in it, we can tokenize the radio station names and store an array of 50 integers on the user instead. Querying users who like a certain artist now involves two token lookups: one for the field name and one for the value. But since we cache the translation from value to token, we can use a multi-get in our cache layer to maintain a single round-trip to the cache when translating any number of values.

This optimization certainly adds some indirection and complexity, but when you store hundreds of millions of users like we do at Appboy, it’s a worthwhile optimization. We’ve saved hundreds of gigabytes of expensive SSD space through this trick.

Want to learn more? I’ll be discussing devops at Appboy during the Rackspace Solve NYC Conference on Sept 18th at the Cipriani.