The Importance of a Bug
- 3 minutes read - 590 wordsI remember the first bug that I shipped to production. I was upset that I’d broken something and was anxious to fix it. But I noticed something curious: the calm demeanor of a senior mentor helping me. They refused to meet my intensity. While the world burned, they wanted to instead discuss the bug and its relative importance.
Junior engineers worrying about code catastrophes seems to be a normal thing. Maybe you are afraid of making a technical mistake and exposing your ignorance. Or shipping a sub-par experience and disappointing your users. I cared about quality before I could consistently deliver it.
In ‘Avoiding Code Catastrophes’, I wrote:
“Step one: don’t panic! Unless your software is powering a shuttle to Mars, the significance of the bug is probably lower than you think.”
Let’s interrogate that. As a frame of reference, we’ll use the US Armed Forces Defense Readiness Condition (DEFCON) system. Here’s a summary:
- DEFCON 1: Maximum military readiness
- DEFCON 2: Ready to deploy and fight in six hours or less
- DEFCON 3: Select forces are ready to deploy in 15 minutes
- DEFCON 4: Above normal readiness
- DEFCON 5: Normal or lowest state of readiness
DEFCON 1 is a true emergency. DEFCON 5 is a low-priority bug we’ll either eventually fix or ignore. The number going up is good.
Let’s say that we know nothing about the bug and we’re starting at DEFCON 1. When you encounter a bug, ask the following questions:
- Who can experience this bug?
- How likely are they to get into the condition to experience this bug?
- If this bug is preventing an action, how important is that action?
- How long has this bug been live?
First: who can experience this bug? Users are typically segmented into roles: you might have admins, employees, logged-in customers, and logged-out customers. Who can experience the bug? If the answer is only employees, you can manage expectations with them directly, so DEFCON raises. We can fix this bug like any other.
Is the bug visible to a higher-priority user? Let’s continue.
How likely are they to get into the condition to see this bug? Does this happen when a certain action is taken? When does it happen? If you aren’t sure how common it is, look at metrics to estimate. Sometimes the answer is: “the homepage is a white screen for every user.” And many other times, it’s: “this promo code that we sent to 10 people doesn’t work on a Leap Day, which ends in 30 minutes.” Not an emergency.
Still on high alert?
If the bug prevents an action, how important is that action? Is this preventing checkout ($$$), product customization ($$), or applying a promo code ($)? How does this impact the business? If just a little, raise that DEFCON.
Still panicking?
Next: how long has this bug been live? Systems adapt to long-running bugs. It might even be a WONTFIX. Raise that DEFCON.
Few bugs are true emergencies.
Some might respond: “I’m an operator and my management will tell me if this is priority.” I disagree. We, engineers, need to advise our management on the importance of bugs. Often it takes a programmer to answer many of these questions. So while you’re down there, tell us: what’s the damage?
My mentor cared, but was calm. They understood that fixing a bug requires us to understand it. You’ll work faster because you’re calm. Your calmness will help you confirm that it’s permanently, thoroughly fixed. And you’ll make continual progress because you don’t let every alarm distract you.