The Art Of The Debug

· 1656 words · 8 minute read

I hate to admit it, but throughout my career I’ve fixed thousands of bugs and created more than I would have liked. I’m not going to say I’ve seen them all because there is always a new one lurking around the corner, but at this stage of the game it is hard to surprise me.

Debugging is a very unique activity, it is something that is usually anticipated with disatate, performed with reluctance, and bragged about forever (as said by a wise person on the internet). Simply put, debugging is nothing more than the act of troubleshooting something. As such, its main objective is to isolate and identify the root causes of a system failure in order to assess any potential damage and come up with solutions that help mitigate or ideally stop the system malfunction. Thus, for the brave souls out there about to embark on a debugging adventure, here is a collection of some useful debugging principles that I’ve learned (the hard way) throughout the years.

Do Not Panic 🔗

Panicking is not good, it impairs our ability to make thoughtful decisions. Understand that sometimes things just go wrong and that’s OK, it’s simply part of the job. Usually these sorts of events strengthen the system in the long run and are good at pinpointing areas of the system that might need some love and attention. So before you start debugging, make sure you are calm, take some deep breaths, and keep in mind that in the long run these events tend to be positives.

Know What You Know And Don’t Know 🔗

Usually when debugging there will be useful information that can be leveraged to identify the root causes of a system failure. This information can come in the form of logs, spiking graphs, support tickets, etc. All these sources will have bits of information that can be used to identify what we know, what we do not know, and what we need to know. This step is important because the identified knowns and unknowns will provide a good starting point for the investigation and the direction in which efforts need to be focused.

It is of critical importance to not make any assumptions if they are not well informed. Doing so can lead to wrong conclusions and consequently to the waste of precious debugging time. As a heuristic, if the area of the problem you are trying to solve is not evident within 10 minutes, it is a good time to put your ego aside and get some help.

Use Binary Search To Isolate Root Causes 🔗

Debugging is all about isolation. As you reduce the surface area of the system (a codebase, infrastructure components, etc.) where the issue can be, the easier it is to find the root cause of a system failure.

Using a binary search-like approach to isolate parts of the system tends to be a useful technique. This technique consists of iteratively searching for root causes in different parts of the system. As each part gets searched and confirmed not to be the root cause, they are to be removed as potential areas of the system where the potential failure might be. If you do this enough times, the system failure surface area will be minimized and you will have improved your odds of finding the problem quite significantly.

Understand System Cause Effect Relationships 🔗

When dealing with systems (specially distributed ones), it is important to understand how system components impact one another and what cause-effect relationships exist between them. In order to understand this point a bit better, consider the following example:

  • Message queues are backed up
  • All processes in all workers are consuming messages from the queue (i.e. processing jobs)
  • Job executions are taking a long time
  • Slow database queries
  • The database table grew up in size and queries are not hitting indices

From the previous example we can see that a system problem was signaled by backed up queues, but in fact, the actual root cause was a missing index or perhaps a poorly written query. This shows how being mindful of the different system components and their interactions can be extremely valuable when troubleshooting issues. As such, study the different components of the systems you work with and more importantly the cause-effect relationships that exist between them.

Understand Race Conditions, Locking & Concurrency 🔗

A lot of times system resources (e.g. records in a database) might be attempted to be accessed by different processes at the same time with the hope of mutating such resources’ state. When debugging, it is always useful to keep these possibilities in mind and consider race conditions as potential causes for a system failure. These types of bugs are rare, so if they are not applicable to the system you are debugging, simply disregard this point. Otherwise, answering the following questions might help:

  • What different processes are trying to access the same resource at the same time?
  • Should these processes be running asynchronously?
  • Should there be a lock on the resource(s) being accessed?
  • Is there a lock?
  • If there is a lock, is it locking properly?

Thread safety related bugs are some of the hardest ones to troubleshoot, mostly because they are tricky to reproduce and happen very infrequently. With enough luck, answering the 5 questions presented above will help provide some clarity to your situation, or at least point you in the right direction. If answering them did not work, go back to the whiteboard and try again.

Keep records of all your findings 🔗

This is a very important point that is easily overlooked mostly due to the fact that during a crisis one is usually focused on actually solving the crisis. However, you should always keep track of all the different documentable or loggable actions you took in order to identify the root cause of a system failure or it’s resolution. Such record keeping will prove invaluable during postmortems or other mistake learning activities that can help how to prevent the same system failure from happening again. Some examples of documentable actions I like to keep track of are:

  • Queries ran against different data stores (e.e. MySQL, Redis, Dynamo, etc.)
  • Commands executed (e.g. scaled Kuberenetes pods, flushed a queue, restarted a database, etc.)
  • Logging dashboards (e.g. Kibana)
  • Exception monitoring dashboards (e.g. Sentry or Bugsnag)
  • Instrumentation graphs (e.g. statsd or DataDog)

One hack that I like is to share all the troubleshooting records in Slack or whatever team messaging tool you use. This way you kill two birds with one stone, you communicate everything to the team in a transparent fashion and you keep tabs on every relevant action you took during the troubleshooting process. Just be mindful of not sharing any sensitive data while broadcasting your debugging records, security should always remain priority number one.

Communicate, Communicate & Communicate 🔗

Communication is one of those things that we think we do really well when as a matter of fact the majority of us suck at it. Most problems in any social dynamic occur due to poor communication. This is because often what we or others mean is interpreted incorrectly or because we assume that someone knows something they don’t and vice versa. If you couple this with hierarchies, egos, fear, and things going wrong, communication can get really messy which will only make things worse.

The following are some simple communication rules that I find useful when putting out a fire our troubleshooting something:

  • Always err on the side of over-communication, even to the extent of being annoying or looking dumb. It is better to be safe than sorry
  • Don’t take things personally. During difficult situations where uncertainty is high, it is natural for people not to communicate in the most polite way possible. The reality is that these situations are stressful and being nice is just not a priority. If someone disrespects you or you disrespect someone, simply have a chat after things have cooled down and make amends. In the meantime focus your efforts on solving the problem
  • Communicate in a relaxed and easy going way that encourages those around you to relax and stay calm. I learned this tip from listening to an interview of the legendary Mixed Martial Arts coach Greg Jackson. On it, Greg pointed out that fighting is an extremely intense and stressful activity where it pays off to remain rational and clear minded. As such, the only thing Greg tells fighters on his corner is to stay calm, breathe, and relax, and he does so with a soothing voice. I think this translates nicely to any stressful situation where being clear minded matters, just like debugging during a defcon for example. People like to copy what others around them do, so if you stay calm and play things cool you will help your team stay calm and play things cool, which in turn makes debugging easier
  • Try to be clear and explain why things happen. Prioritize clarity and rationale when explaining something, even if it takes you a bit longer. Speed matters, but being misunderstood is often more costly and slower
  • Trusting yourself and others is good, but not trusting anyone is better. Always verify and validate what you or others say and be weary of claims that are not backed by evidence. To paraphrase Mark Twain “it is not what you do not know that kills you, but what you think you know for sure that just ain’t so”

Final Thoughts 🔗

In reality, there is no perfect strategy or set of principles for debugging. Bugs come in all shapes, colors, and sizes. The impact of a simple bug can be massive and the impact of a complex one can be minimal. Just as with everything else, you get better at it with practice and by just being in the arena. At the end of the day all we can do is type things into a keyboard until things start to work as one would expect.