Planning For Failure

· 817 words · 4 minute read

I’ve made a fair amount of mistakes in my life. Some of them small and some of them big. For instance, my second week at Stitch Labs I was tasked with caching in-app notifications. This made a lot of sense, in-app notifications are only updated once after they are created and are accessed very frequently (essentially every time an app loads). Therefore, by caching them we would save a good amount of valuable MySQL queries. In just a few hours I had everything working. The code was reviewed, approved, and ready to be deployed. However, it was late during the day and so I decided that it was best to push the code out the following morning, which I unfortunately did.

Everything looked good post-deploy. The app load time was slightly faster, Redis memory utilization went up (as expected) but seemed fine, and we could see substantially less QPS (queries per second) against MySQL nodes. All of these pointed to a job well done and as such I decided to go for an early lunch. I was about to have the third bite of a philly cheesesteak when my phone started buzzing with frantic Slack messages. We had a defcon situation which meant that the app was down for every single customer. Since Stitch Labs was what’s referred to as a “business-critical” app, the situation meant that thousands of businesses were not able to fulfill orders nor manage their inventory across different platforms. Needless to say, support requests volume went through the roof, sales demos flopped, and engineers had to stop whatever they were doing to put out the fire. Thankfully, it was identified that our Redis instances had run out of memory and thus flushing them fixed the problem. Nonetheless, this also meant that the app going down was my fault (in essence I was not properly invalidating cached notifications). This was a costly mistake from which I re-learned a very valuable lesson: pessimism and planning for failure pay off.

One of the advantages of having worked with distributed systems (a.k.a. the cloud) for most of the past decade is that I’ve gotten to do a lot of planning for failure. Depending on the type of problem being solved, a distributed system will either be required to be available or consistent for different types of operations. This means that in the event of failure, the system is expected to tolerate it as if nothing happened (i.e. remain available) or handle it gracefully without data corruption (i.e. remain consistent). No matter what, the hard reality is that things will fail. And when you are processing billions of operations a day, even a 0.000001% probability of something bad happening means that something bad will happen at least 100 times a day. Naturally, these sorts of working conditions force you to constantly and meticulously think of how things might fail and how such failures could be prevented or handled eloquently. What I’m trying to say is that when you work with systems (no matter the type) a failure will eventually happen and the best way to deal with it is to be ready for it.

Contrary to general belief and quite ironically, planning for failure is the best way to achieve a positive outcome. In other words, being a pessimist who dares is usually better than an optimist that hopes. The reason for this lies in the concept of asymmetry. If you are familiar with @nntaleb’s work or have studied fat tailed distributions you probably already know this, but if you don’t I’ll try to explain. In essence, asymmetry refers to the likelihood disparity between outcomes and the potential payoff relative to such disparity. Here is an example, imagine I offered you the chance to win 1 million dollars by betting a 100 dollars on something where you have 1/10000 odds of winning. Naturally, if you were to accept the bet, you would lose most of the time, unlike say a symmetric coin toss where you would win 50% of the time. However, if you were to win just once you would have made up for all your losses and potentially more. Simply put, it is not how likely an event is to happen that matters but how consequential the event is when it happens that should be the consideration.

This does not mean that we should avoid failure or stop taking risks. Instead, we should take more risks but in areas where we are able to fail fast, often, in isolation and most importantly well. By failing well I mean in a way that is not fatal and helps drive improvement through lessons learned. In conclusion, it pays off to be a paranoid risk taker. The combination of paranoia and risk taking will help your system adapt and evolve. And just as in nature, at the end is not the strongest who survives chaos, but the most adaptable.