The standard redundant Redis solution is to run master/slave replication with Sentinel managing the failover. This is expected to be followed up with either a) Client support and use of Sentinel to discover the current master or b) A TCP proxy in front of the Redis pod which is managed by Sentinel to point to the master. The former is the ways Redis Sentinel is designed and the latter is a growing trend – and the way ObjectRocket Redis is configured.
The effects of failover on commands in Redis
For the vast majority of Redis operations this works as expected. On a failover the client code is expected to detect the connection has failed and either a) look up the current master and reconnect or b) attempt a reconnect based on the scenarios mentioned above. The proxy based scenario is also identical to a Redis server restart so this is also applicable to a case where you have a single instance Redis which can be restarted.
However a small handful of Redis commands are long-term blocking or non-atomic. The PUBSUB subscribe and psubscribe commands are non-atomic in that the command will register it’s request to have any messages sent to the given channel(s) sent to it as well, but the actual data can arrive much later – ie. after it is published. The BL* commands are long-term and two phase commands in that like the subscribe commands they will block and not return until an item is available in the list to be removed.
A key commonality between these commands is that they are issued and a form of registration is done on the server, but the data can come later. In between the command being issued and the server having data to reply with failovers and restarts can occur. In the case of a restart or a failover the “registration” is no longer on the server. This brings us to the question of what happens regarding these commands in these scenarios.
Before delving into those details there is another aspect to bring up as it will come into play during the testing and has dramatic effects on the results and how to code for the failure scenarios. There are in fact two types of failovers. There is the failover most of us consider – the loss of the master. In this type of failover the master is non-responsive. I refer to this as a “triggered failover”. The second class of failovers I refer to as the “initiated failover”.
The initiated failover takes place when an administrator sends the sentinel failover command to a Sentinel which then reconfigures the pod, promoting a master then demoting the original master. On the surface these two scenarios appear to be the same. In the case of the vast majority of Redis commands they can be treated identically. However, in the case of our long-block commands they must be understood in their own context.
The key functional differentiator between these two failover classes is the sequence of events. When a triggered failover occurs it is because the master is not responding. With that being the case no data is being added or modified, and no messages published on the master. In an initiated failover there is a very brief window where there are two masters.
The techniques and scenarios discussed here will be described in the general and demonstrated using the Python Redis library “redis-py”, but apply to any client library. With the background in hand, let us now look at how this affects the use of the long-block commands. First let us look at PUBSUB.
Redis PUBSUB consists of three commands: PUBLISH, SUBSCRIBE, and PSUBSCRIBE. The first command is used to send a message whereas the latter two are used to register for and receive messages published via the PUBLISH command. First, let us examine what happens in a PUBLISH command.
When you execute PUBLISH it will publish the message to the given channel immediately and return the number of clients which received it. In what way does a failure scenario bring this to relevance? Consider the case of failover. The client will have to reconnect as the TCP connection goes away. If talking to a proxy it will do a direct reconnect and get a different server. If doing Sentinel discovery it will take a bit longer and still connect to a different server. Where the issue lies is in timing. If the publisher connects to the new server before any subscribers do, the message is gone – lost. But can that happen? To understand that we look to the SUBSCRIBE commands.
When your code issues a subscribe command the server registers in it’s in-memory data structures that that specific connection should get messages on the channel(s) subscribed to. This information is not propagated to slaves. Thus when a failure happens and the client reconnects it must then re-issue the subscribe command. Otherwise the new server does not know that particular connection needs to receive certain messages. Now we have the clear potential for a subscriber to reconnect to the new master after the publisher. The conditions for loss of messages now exist. So how to minimize or prevent this? There are a few options.
First the publisher can leverage the fact that PUBLISH returns the number of clients which received it. If zero clients received it the publisher could retry. For systems where this is a static number of subscribers or where “at least one subscriber” is sufficient this would prevent the message loss. For systems which do not meet this criteria, there is another, if less robust, option.
The second option is to control the reconnect windows. The rule of thumb here is to have the publisher delay be at least three times as long as the subscriber delay. This will provide at least three opportunities to have the subscribers reconnect prior to the publishing of messages. So long as the subscribers are online first, the messages will not go into the ether. However, while more robust than not controlling it there is still the possibility of message loss.
The third option I’ll present to mitigate this race condition is to build in a lock mechanism. This is certainly the most complex route as it requires building or employing a mechanism which prevents the publishers from publishing until the clients connect. The basic process is to have some form of shared location (such as via Zookeeper, a database, Consul, etc.) where the subscribers register that they are ready to receive and the subscribers check for valid and ready subscribers before publishing messages. To further ad complexity the subscribers will have to “unset” themselves in this mechanism when they lose the TCP connection.
It needs to be called out that this is all for a triggered failover. It will work because the master is non-responsive thus no messages go to it. In the initiated failover scenario the master will continue accepting messages until it is no longer the master or is disconnected. Of the above methods none will properly and fully catch this scenario.
In the case of initiated failovers (which can occur during system maintenance for example) your best buttressing effect is to employ both the first and second options, and accept that there can be message loss. How much message loss is entirely dependent on you publishing schedule. If you publish a message on average every few minutes there is a minuscule chance for you actually encounter the loss window. The larger the publishing interval the smaller the risk. Conversely if you are publishing hundreds or thousands of messages per second you will certainly lose some. For the managed proxy scenario it will be dependent on how quickly the failover completes in your proxy. For ObjectRocket’s Redis platform this window is up to 1.5 seconds.
Thus one approach on the interval, which fails at the limits, is to have the subscribers try to reconnect at an interval of 1/3d your publishing interval. Thus if you publish a message once per minute, then code/configure your subscribers to reconnect at least as often as every 20 seconds and your publisher every minute or two. As the publishing interval approaches the 3 second mark this becomes harder to achieve as a failover reconnect process (lookups, TCP handshake, AUTH, SUBSCRIBE, etc.) can easily total a few seconds.
BLPOP and friends
This brings us to the question of how the blocking list commands operate under these conditions. The news here is better than for publishing. In this case we are modifying data structures which are replicated. Furthermore, because these are modifications to data structures we don’t have the same level for risk of lost messages due to order of reconnect. If the producer reconnect prior to a subscriber there is no change to expected behavior. The producer will PUSH the item onto the list (or POP and PUSH) and once the consumer connects the data will be there. With one exception.
In the case of an initiated failover we must consider the brief multiple-master problem. For a few moments, on the order of milliseconds, the original master during an initiated failover will still accept data modifications and as a slave has already been promoted to master these modifications will not replicate. Note this window is very small. In testing it is the order of single digit milliseconds, but it is there. As with publishing this is essentially a corner condition where your odds of, for example, issuing a POP command to the stale master is rare though increases to conceivable conditions when your rate of producers and consumers making list modification commands reaches thousands per second.
In the case of a worker doing BRPOPLPUSH, for example, the result will be such that the item being moved will be “back” in it’s original position after the failover. In the case of a BLPOP the result will be essentially the same. After the failover completes the item will appear to be re-queued. If your items are idempotent jobs, this will not be an issue. How much defensive coding you should do to account for this in the case of non-idempotent jobs is a result of determining the effect of a job being run or item processed twice factoring in how frequently you are making modifications and this the likelihood of encountering this situation. It should also be noted that since this only happens in the event of an initiated failover this should be something under operational control and I would advise that whenever possible maintenance should be done at a time when your systems are at their least usage to minimize or even eliminate this possibility.
For the triggered failover there is no expectation of data loss. A triggered failover happens when the master is unresponsive, thus no modifications are being made to the master. However, as with each of these scenarios there is the matter of handling the actual TCP reconnect which is mandatory in any failover scenario.
Under either failover scenario, triggered or initiated, the client will have to detect and reconnect – either after a Sentinel lookup or a brief window to the same address. In the case of ObjectRocket Redis which utilizes a managed proxy layer the client will simply reconnect. The actual failover process itself can take up to two seconds to complete. Thus the client code should account for this. Ideally an immediate retry should take place to handle network blips where a connection is simply dropped somewhere along the route. However this should be accompanied with a back algorithm with retries to account for cases such as server restarts (as happens in either standalone Redis restarts or a proxy restart).
Upon reconnect any subscribers will need to re-subscribe as the link between the request channels and the new TCP connection must be established. Since Redis’ PUBSUB channels are created when either a subscriber or publisher attempts to access them, there is no need for “recreating” the channel.
This brings us to how to do this. The answer is highly dependent on the client library in use and how it handles disconnects. In an ideal scenario the connection would be configurable for automatic retry with an ultimate failure returned if boundaries are met. So far I’ve tested a couple libraries for how they handle this and it varies dramatically. Here I’ll discuss a rather common one: redis-py.
The first bit of good news is that redis-py does appear to retry when the connection drops. Unfortunately it immediately retries, and thus is too quick to reliably recover the connection during a failover. Further there appears to be no way to configure this. As a result your code must catch/detect the failed reconnect attempt and manage the reconnect yourself.
First let us examine some standard redis-py publish and subscriber code blocks:
This is fairly straightforward. However, when the immediate retry during the subscriber’s for-loop fails you will get a redis.ConnectionError exception thrown. The tricky bit is it happens “inside” the for item in p.listen() line. Thus to properly catch it you must wrap the entire for statement in a try/except block. This is problematic at best and leads to unnecessary code complexity.
An alternative route is to do the following instead.
With this method we call get_message() directly which allows us to catch the exception at that point and re-establish the ‘p’ object’s connection. The amount of time to sleep, if at all, is dependent on your code’s requirements. Of course if your subscriber expects to handle a certain number of messages and a for-loop works better, it will still work. For publishers the code is simpler as it is generally not run on an iterator.
With this method you have control over whether, when, and how often to retry. This is critical to transparently handling the failure event properly. Note that this would be the same thing necessary if the code is using Sentinel directly or even if the Redis server in question restarted. As such really all publishing code should follow this basic mechanism.
For the blocking list commands such as BLPOP, the order of client reconnection isn’t strictly important as the data is persisted. The try/except method described above will be needed to re-establish the connection when the command execution results in a “redis.ConnectionError” exception being thrown.
For redis-py specifically against the ObjectRocket Redis platform using these techniques with a retry window encompassing 3 seconds will ensure that there is no data loss expectation for triggered failovers, and approximately 1.5 seconds of potential message loss for non-stop publishers during an initiated failover such as during a vertical resize of the instance.
While the code examples were specific to redis-py, the basic techniques should be used with any client code which needs to handle server reconnects and makes use of either the PUBSUB commands or blocking list operations. Armed with this knowledge you can now implement these techniques to use a highly available Redis pod and know your application can handle failovers without waking your ops team up in the middle of the night.