Thoughts on Sitecore Knowledge Base 939840

This post outlines some discoveries related to Sitecore knowledge base 939840, an article that explains how to avoid timeouts related to saving contacts with a large number of interactions to the Sitecore Analytics database.

During our benchmarks for the presentation “How Sitecore depends on MongoDB for scalability and performance, and what it can teach you”, we noticed a few exceptions from a partially-covered query on the Interactions collection.

The erroneous query was of the type:

{ $query: { ContactId: <value>, _t: <value> }, $orderby: { StartDateTime: 1, _id: 1 } }

The exception observed in the MongoDB logs was as follows (where N > 33554432):

assertion 17144 Plan executor error during find: Overflow sort stage buffered data usage of N bytes exceeds internal limit of 33554432 bytes ns:foo.Interactions query:{ $query: { ContactId: <value>, _t: <value> }, $orderby: { StartDateTime: 1, _id: 1 } }

By default, Sitecore deploys the following indexes on the Interactions collection:

"_id" : 1
"ContactId" : 1, "ContactVisitIndex" : 1
"SaveDateTime" : 1, "ContactId" : 1, "_id" : 1
"StartDateTime" : 1
"ContactId" : 1, "_t" : 1

From the above indexes, {“ContactId” : 1, “_t” : 1} fully covers the $query phase, but not the $orderby phase (Note: Index {“ContactId” : 1, “ContactVisitIndex” : 1} also partially covers the $query phase). If MongoDB cannot use an index to get documents in the requested sort order, the combined size of all documents in the sort operation (plus a small overhead) must be less than 32 megabytes. If the sort operation exceeds the 32MiB threshold, an exception is generated.

The workaround described in the Sitecore knowledge base article entitled “Analytics database time-out can occur when a contact session is submitted to the database” correctly states that a new index on {“ContactId”: 1, “_t”: 1, “StartDateTime”: 1, “_id”: 1} should be created.

An index on {“ContactId”: 1, “_t”: 1, “StartDateTime”: 1, “_id”: 1} will cover both $query and $sort phases and not only will eliminate the exception, but will also make the erroneous query faster.

The new index on {“ContactId”: 1, “_t”: 1, “StartDateTime”: 1, “_id”: 1} makes the {“ContactId”: 1, “_t”: 1} index redundant. It can be removed to save space as long as you remove it from your Sitecore implementation. If you don’t, Sitecore may attempt to re-deploy the index when the application is next restarted.

The {“ContactId”: 1, “_t”: 1, “StartDateTime”: 1, “_id”: 1} index is larger than the {“ContactId”: 1, “_t”: 1} index. On our implementation (using WiredTiger with snappy compression), the document count for the Interactions collection was 6,400,000, and the size of each index was:

"ContactId_1__t_1_StartDateTime_1__id_1" : 362 MiB,     
"ContactId_1__t_1" : 190 MiB

The {“ContactId”: 1, “_t”: 1, “StartDateTime”: 1, “_id”: 1} index added 362 MiB on storage but if we decide to drop {“ContactId”: 1, “_t”: 1}, the overhead is 172 MiB.

We noticed that the erroneous query was always using the same value predicate for _t, specifically {“_t” : “VisitData”}. Since we were using Sitecore 8.2, we have the option to deploy MongoDB 3.2 for the xDB feature. MongoDB 3.2 supports partial indexes, so we changed the index definition from {“ContactId”: 1, “_t”: 1, “StartDateTime”: 1, “_id”: 1} to:

{"ContactId": 1, "StartDateTime": 1, "_id": 1},"partialFilterExpression" : {"_t" : "VisitData"}

The size of each index was:

"ContactId_1__t_1_StartDateTime_1__id_1" : 362 MiB,
"ContactId_1_StartDateTime_1__id_1" : 298 MiB,
"ContactId_1__t_1" : 190 MiB

We managed to reduce the index size by 64MiB and the overhead to 298 MiB of storage, but if we decide to drop {“ContactId”: 1, “_t”: 1}, the overhead is 108 MiB.

The next challenge was to identify potentially erroneous queries in advance.

As the Sitecore documentation states, “Sitecore calculates diskspace sizing projections using 5KB per interaction and 2.5KB per identified contact and these two items make up 80% of the diskspace”. According to the documentation, we will hit the exception when more than 6553 documents exists for a single {ContactId, _t} combination. The following aggregation will return the combination(s) that may potentially result in the buffer overflow exception:

db.Interactions.aggregate([{$group: { _id:{c_id:"$ContactId", c_t:"$_t"},number : {$sum:1}}},{$match:{number:{$gte:6553}}}],{allowDiskUse:true})

In our environment, there was only one combination with count higher than 6553, so we decided to deploy a partial index just for the erroneous combination:

 {"StartDateTime" : 1, "_id" : 1},{"partialFilterExpression":{"_t" : "VisitData","ContactId":BinData(3,"")}})

The new partial index was just 272 KiB!

The above technique, deploying one partial index per combination, may be problematic when you have more than one combination with count higher than 6553. By design, you can’t deploy the same index twice by only changing the partialFilterExpression definition. However, you can overcome the limitation by adding a dummy field at the end of each index definition:

{"StartDateTime" : 1, "_id" : 1, dummy01:1},{"partialFilterExpression":{"_t" : "VisitData","ContactId":BinData(3,"")}})

Using the dummy field is a fine workaround if you don’t have more than 60 combinations. MongoDB, by design, supports up to 65 indexes, so you can deploy a maximum of 60 partial dummy indexes (as 5 indexes already exists). If you have more than 60 combinations, the workaround won’t be efficient. In general, you should avoid adding dummy partial indexes unless you know in advance that only a limited amount of {ContactId, _t} combinations may exceed the 6553 documents limit.

The next challenge was to define with accuracy which number of documents for a single {ContactId, _t} combination that may produce an exception. Our first action was to identify the break point on our environment; to do this, we executed the erroneous query with a limit() higher than 6553 using the following script:

for (i = 6553; i <= CountForTheCombination; i=i+1){ x=db.Interactions.find({"_t" : "VisitData", "ContactId" : BinData(3,"7HoQth67QkG3Il/A5XSDPw==")}).sort({"StartDateTime":1, "_id":1}).limit(i).hint("ContactId_1__t_1").explain(true).executionStats.executionStages.shards[0].errorMessage; if (x=="Exec error: OperationFailed: Sort operation used more than the maximum 33554432 bytes of RAM. Add an index, or specify a smaller limit., state: FAILURE") {print("You will hit the bug with: "+i+" documents"); break;}}

The script returned the first exception on 7080 documents, which means that the size of 7080 documents is slightly higher than 32 MiB.

The next step was to develop an algorithm based on the avgObjSize. Given that MongoDB is using the raw documents (uncompressed data) during the $orderby phase, WiredTiger is a greater challenge than MMAP, as the MMAP storage engine does not support compression. In MMAP, the avgObjSize can easily be retrieved from the collection stats using:

db.Interactions.stats().avgObjSize.

When using WiredTiger, the same command returns the compressed avgObjSize. From past research (based on various Sitecore datasets between 10 and 20GiB, note: compression ratio increases as dataset grows), we already know that the compression ratio for Interactions is ~2.4 with snappy compression, i.e., 2.4 bytes uncompressed equals 1 byte compressed.

In our environment, avgObjSize was 1966 bytes, which gives us an uncompressed size of 4719 bytes. Using uncompressed avgObjSize, we estimate that the exception might occur with 7111 documents, which is close to real break point of 7080 documents.

It seems that using the avgObjSize provided us with a good estimation, but the estimation was higher than the actual break point. If you want to be proactive and use the algorithm to predict if the exception will occur in the future, it’s good practice to either add a corrective multiplier on the equation. For example, multiply it with 0.98 or increase the compression ratio. Both will produce a lower document limit.

Building the index is impactful, and you may want to perform the operation during off-peak hours. Even if you are using the secondary index build method, which is less impactful, the index build will still take some time to complete. In the meantime, your MongoDB instance is still vulnerable to the exception. In order to mitigate the exception until the index is in place, you may temporarily increase the internalQueryExecMaxBlockingSortBytes variable. The default value is 32MiB. In order to calculate the new value, you may use the maximum number (N value) from the error produced on the logs.

assertion 17144 Plan executor error during find: Overflow sort stage buffered data usage of N bytes exceeds internal limit of 33554432 bytes ns:foo.Interactions query:{ $query: { ContactId: <value>, _t: <value> }, $orderby: { StartDateTime: 1, _id: 1 } }

Changing internalQueryExecMaxBlockingSortBytes is a band-aid, and you should revert to 32MiB once the index is in place. Increasing internalQueryExecMaxBlockingSortBytes on an instance will increase the memory usage and may hurt the overall performance.

To change internalQueryExecMaxBlockingSortBytes on-the-fly, use the following command:

db.adminCommand({setParameter: 1, internalQueryExecMaxBlockingSortBytes: N})

To make the change permanent in order to protect from restarts until the index is in place, add the following configuration option to each mongod:

setParameter:
    internalQueryExecMaxBlockingSortBytes: N

Sharding on _id is a factor that may reduce the occurrences of the exception and, in some cases, can totally eliminate it (e.g. running Sitecore without the $orderby optimized index). When sharding on _id, each shard will hold a portion of the overloaded {ContactId, _t} combination. Even if you are ordering on an overloaded {ContactId, _t}, the exception may never occur, as each shard will only have to sort a subset of the overloaded {ContactId, _t} combination locally and then send it to the mongos. Sharding on ContactId won’t have any impact as far as the exception occurrence. All documents for the overloaded {ContactId, _t} combination will live on the same shard so you will get the exception. Learn more about scaling and Sitecore by viewing this slide deck.

As always, the ObjectRocket team is here to help you design, configure and deploy your xDB! Contact us at dataexperts@objectrocket.com. We’d love to hear from you!

Thoughts on Sitecore Knowledge Base 939840

Antonios Giannopoulos

Previous PostCrimson Study: ObjectRocket DBaaS lowered data management costs by orders of magnitude

Next PostRocketScale™: Auto-scale your ObjectRocket for Elasticsearch Instances with Ease

Services

Platform

Resources

COMPANY

Support