Finding unused indexes in MongoDB

Antonios Giannopoulos

8 years ago

Beginning in version 3.2, MongoDB tracks usage statistics for every index. To access these statistics, MongoDB offers the $indexStats aggregation pipeline stage. Here are six considerations when finding unused indexes in MongoDB.

For example, the following command will provide the index statistics for the collection “test.foo”:

db.foo.aggregate( [ { $indexStats: { } } ] )

https://bit.ly/2seXnzo

We won’t describe the $indexStats output as there is plenty of documentation and many good articles out there that cover this topic. Instead, we will supply six considerations on different areas when using the $indexStats operator.

Consideration 1: Statistics are reset on every service restart

Always pay extra attention to the “accesses.since” field when using the $indexStats operator. Some query patterns might be infrequent, such as an end of day batch process or a weekly report, so make sure the duration of the statistics you are going to evaluate covers your needs. The following script allows you to set a threshold (in hours) and will print a warning if an unused index doesn’t comply.

threshold_hours=24; 

db.foo.aggregate( [ { $indexStats: { } } ] ).forEach(function(f){if (f.accesses.ops==0) 
{if (ISODate()-f.accesses.since < threshold_hours*3600*1000) {print('Index: ' +f.name+ ' accessed: 
' +f.accesses.ops+ ' times, Status:WARNING, The duration of statistics DOES NOT meet your compliance')} 
else {print('Index: ' +f.name+ ' accessed: ' +f.accesses.ops+ ' times,  Status:OK The duration of 
statistics meet your compliance');};}})

What is the right threshold for you? There’s no simple answer. Not all statement patterns are executed with the same frequency, so the right threshold varies from application to application and may also be different per collection within the same database.

Consideration 2: Secondary reads

By default, $indexStats reads from the Primary. If your application only reads from the Secondaries, running the $indexStats against the Primary will lead to the wrong conclusions. In order to read the results from the Secondaries, you may use the script from the first consideration above, but execute db.getMongo().setReadPref(‘secondary’) before running it, which forces the command to read from the Secondaries.

The question now is “what if a collection receives both Primary and Secondary reads?” Using the following script, you can populate the idx array with the indexes that are unused on both Primary and Secondary:

idx=[];

db.foo.getIndexes().forEach(function(f){idx.push(f.name)})
db.getMongo().setReadPref('primary');
db.foo.aggregate( [ { $indexStats: { } } ] ).forEach(function(f){if (f.accesses.ops>0) 
{ var index = idx.indexOf(f.name); if (index > -1) {idx.splice(index, 1);};}})
db.getMongo().setReadPref('secondary');
db.foo.aggregate( [ { $indexStats: { } } ] ).forEach(function(f){if (f.accesses.ops>0) 
{ var index = idx.indexOf(f.name); if (index > -1) {idx.splice(index, 1);};}})

The contents of idx array will contain the unused indexes on both Primary and Secondaries. What about the compliance window (threshold_hours)? Slightly modifying the script from consideration 1 (previous subsection) will allow us to apply the time compliance. Just replace the aggregation part with the one provided below and execute the script against Primary and Secondaries using the setReadPref:

db.foo.aggregate( [ { $indexStats: { } } ,{$match:{"name" : {$in:idx}}}] ).

If Primary and Secondaries return different compliance results, the best approach is to wait for both tiers to comply.

Using Secondary read preference may lead to “incorrect results” in the case that a Secondary is recently restarted and the script chooses that Secondary to pull statistics. For this case, a good approach is to check db.serverStatus().uptime and pick the Secondary with the higher uptime. The uptime method requires a different approach on the script which will be implemented in a future revision of this blog post.

Consideration 3: Replica set tags

Replica set tags are applicable in lot of use cases but are mainly used for read locality in a geo-replication scenario and for targeting specific workloads to dedicated nodes (e.g., heavy analytics operations). When it comes to workload targeting, the tagged nodes must be examined separately as they may use an index or indexes that other nodes don’t use. This may also be the case in geo-replication, although it’s more rare. The script from consideration 1 may be used to check a tagged member, with the addition of readPreference at the very beginning of the script. Here is an example for setting the read preference to a Secondary that is marked for analytics:

 db.getMongo().setReadPref('secondary', [ { "workload": "analytics" } ] ) .

The future approach, described in the last paragraph of in the second consideration above, can be expanded to check tagged Secondaries as well.

Consideration 4: TTL indexes

TTL indexes serve operations but their main use is for data pruning. It’s very likely for the $indexStats to report these indexes as unused because the TTL monitor doesn’t count as an index operation. You definitely don’t want to drop a TTL index based on the $indexStats report findings so the scripts must exclude this type of index.

A small modification on the populate idx array script from the second section will do the job. Replace the second line with the following one and the idx array won’t contain any TTL index(es).

db.foo.getIndexes().forEach(function(f){if (f.expireAfterSeconds==undefined) 
{idx.push(f.name)}})

Taking the discussion about excluded indexes one step further, we can claim that _id also falls into this category. Obviously, you can’t drop the _id index even if its unused. A collection that never touches the _id index may need redesign.

Consideration 5: Sharded clusters

When it comes to sharded clusters, there are two items to consider before evaluating the output of $indexStats. First, when using the $indexStats against a sharded collection, it is possible the shard key index is classified as unused. For example, a write-heavy collection which is sharded on {_id:”hashed”} (for even distribution of writes) but without any reads/updates/deletes operations that can utilize the _id index. In this case, the {_id:”hashed”} reports as unused. It’s not a good idea to drop the index as it’s going to break your sharded cluster. If you want to exclude the shard key for ‘stats.foo’ collection, the following script can be attached to the method described in consideration two (secondary reads) and shard key will be excluded from the idx array.

shardkey=db.getSiblingDB('config').collections.findOne({_id:'stats.foo'},{_id:0,key:1});
db.foo.getIndexes().forEach(function(f){if (JSON.stringify(f.key)!=JSON.stringify(shardkey.key)) 
{printjson(shardkey.key);printjson(f.key);idx.push(f.name)}})

Another consideration is related to the output of $indexStats, as it now returns the statistics from each and every shard the collection has a presence. While it is uncommon, there are cases that an index(es) reports as unused in a few shards while other shards using it. A poor shard key can be the cause but the most common scenario is named “covering indexes”.

Here is an example to better understand covering indexes: Indexes {a:1} and {a:1,b:1} can both serve an equality match on field ‘a’. If shardA’s optimizer picks {a:1} and shardB’s picks {a:1,b:1}, both indexes will report unused for at least one shard.

The challenge is to find which indexes are not globally in use. Replacing the aggregation part (followed by a forEach loop) of consideration 2 script (secondary reads) with the following one will do the job:

db.foo.aggregate( [ { $indexStats: { } } , {$group: {_id:"$name",number :
 {$sum:"$accesses.ops"}}}] ).forEach(function(f){if (f.number>0) { var index = idx.indexOf(f._id);
 if (index > -1) {idx.splice(index, 1);};}})

Another challenge is to discover the indexes that are partially in use. This piece of information could be helpful when it comes to redundant indexes or index identification malfunctions. Using the idx array, the following aggregation/script is populated with the globally unused indexes and reports the partially-in-use indexes:

db.foo.aggregate( [ { $indexStats: { } 
},{$match:{name:{$nin:idx},"accesses.ops":0}}]).forEach(function(f){print("Index "+f.name+" 
reports as partially unused on shard/host " +f.host)})

Consideration 6: Least-used indexes

Discovering and dropping unused indexes is crucial but it’s also important to evaluate least used indexes. If an index is only accessed one or two times in a week or months’ time, it may mean that it’s not necessary or beneficial for your workload. It’s always a good practice to periodically check for least-used indexes using the following aggregation:

db.foo.aggregate( [ { $indexStats: { } },{$match:{"accesses.ops":{$gt:0}}},{$group: 
{_id:"$name",number : {$sum:"$accesses.ops"}}},{$sort:{number:1}}] )

Then, take the appropriate actions which could be either to drop the index(es), change the index(es) definition, or even change your schema/application logic to make the index(es) redundant.

We’re here for you

We’re here to help you clean up your indexes for your MongoDB instances. All you have to do is create a ticket to support@objectrocket.com and let us help you improve your indexes!