The bug map provides a simple 2D visualization of the bug’s causality chain, thereby exposing the root cause. But this is only after having the bug map fully visible, and the fog cleared. As promised, I’d like to share the best practices I learned from books, articles and most importantly colleagues on how to clear the fog.

CC Attribution-Share Alike 3.0 Unported license. (Nybygger at English Wikipedia)
The pillars of your debugging work are:
- Core team: Additional people might come and go but the core bugfixing team should remain the same. Because (a) there will always be some implicit knowledge that you can’t reflect in documents, and (b) you want to keep the teamwork rhythm going. The core team should have very high bandwidth communication, so try to refrain from multi-site bugfixing teams.
- Backlog: A list of all the actions and shiny ideas that you should try to catch the bug. The backlog will help you track all the ideas that come up, but its main benefit is that it ensures progress via focus.
- Test results sheet: Throughout debugging, especially for hard-to-reproduce bugs, you’re going to execute many tests and unfortunately forget about them in few hours. Furthermore, you’ll notice that some test parameter that you didn’t even care to note will become relevant based on newly discovered info on the bug. In addition, you’ll never be able to tell your teammates about the interesting test result in a comprehensive way so that the test is described accurately and completely. I’ve witnessed meetings where just telling about what was tested exactly consumed more than half of the meeting. Test results sheet addresses these challenges by providing a simple “truth table” of executed tests with relevant parameters to the debugging effort.
- Bug map: This will help you develop a collective understanding as discussed in the previous post.
On this basis, the bugfixing work begins. I do not dare to list all debugging techniques in this post. Instead, I’d like to recommend Paul Butcher’s very valuable book Debug It!. Among other very useful ideas and techniques, Butcher divides the bugfixing process into four main phases:
- Reproduce: Have a reliable repro, i.e. work on the left-hand side of the bug map.
- Diagnose: Discover the path of causality from the repro to the symptoms, i.e. clear the fog on the bug map, and identify the root cause.
- Fix: Remove the root cause from the bug map.
- Reflect: Think about why the bug occurred in the first place. What was the underlying reason for this bug to occur? Could there be more related bugs? Most importantly, what kind of tests can we introduce to prevent this bug from occurring again?
This division into phases helps to give a general direction to the collective bugfixing effort. Generally, starting on the left side of the bug map is recommended for having a reliable repro. Here the test results sheet may help you a lot in identifying the relevant test parameters that lead to the symptom. Unfortunately, sometimes a 100% reliable repro is not available and you have to fall back to stochastic testing, i.e. using collecting multiple samples with same test parameters and exploring probabilities.
For the diagnosis, there are countless methods. Just to mention two: (1) zoom in on the bug map, i.e. find interesting sub-maps of the bug map where you can focus on a particular part of your system, decreasing scale will decrease complexity dramatically[1]A classical example of this is narrowing down the bug to the return value of a single function, so you can focus solely on this function.; and (2) one of Butcher’s recommendations: build hypothesis bridges through the fog and test your hypothesis. If the test succeeds you have cleared through the fog in one step[2]For example, in the device freeze issue, when we were able to formulate the hypothesis on double mapping of the memory as cached and uncached, we developed a test case where we used the new memory … Continue reading. Otherwise capture your results in the test results sheet, update the bug map, re-prioritize the actions and continue.
Typically when the fix is done, pressure will rise to focus on the next feature or bug, and not on the reflection phase. But without reflecting and increasing your quality you won’t survive long under the attack of new and old bugs.
So much for my best practices. What about yours? Please comment.
May your bugs be transparent and may most of your bugfixing time be spent in the reflection phase.
Footnotes
↑1 | A classical example of this is narrowing down the bug to the return value of a single function, so you can focus solely on this function. |
---|---|
↑2 | For example, in the device freeze issue, when we were able to formulate the hypothesis on double mapping of the memory as cached and uncached, we developed a test case where we used the new memory mapping model. Then we used our mostly reliable repro to test with and without the fix on a few samples, thereby confirming our hypothesis. |