Bug map

In a previous post, I laid some groundwork for debugging, and promised follow-up posts. This promise is long overdue, so let us continue with what I call a “bug map.”

A bug map is a directional graph that shows the chain of causality from the test cases that exposed the bug in the first place to the symptoms of the bug that could be observed. Let’s look at a concrete example from one of my past projects:

Bug map at the end of debugging
Bug map at the end of debugging

This was a nasty bug that took multiple people from multiple companies and countries multiple months to fix. The main reason for this was, the critical parts of the bug map were buried deep in the CPU architecture, whose internal design we didn’t know and our tools offered limited help in analyzing what was going on in the CPU. [1]Concretely, the chain of causality was something like this: Windows CE had a convention of mapping all addresses as cached and uncached, which is convenient for easy configuration as the 32 bit … Continue reading

Of course, that was the map when we finally found the root cause. In the beginning we had this:

Bug map in the initial phases of debugging
Bug map in the initial phases of debugging

So, we had some ideas about what the relevant part of the test case could be, and what the potential causes for the CPU stall could be. We didn’t know what was in the middle, so we had a fog there. Over time, we worked on the right side of the bug map by understanding how the CPU and the OS works, developing further hypotheses, and testing them. We also worked on the left side by developing more efficient reproduction scenarios (shorter tests with higher probability of reproducing the issue). So the fog cleared up from both sides. Some potential causes turned out to be red herrings. And when we had the complete picture we could develop a fix that removed the root cause [2]Using the new memory configuration of the OS that doesn’t map all addresses cached and uncached.

Coming back to theory, the underlying idea of the bug map is that the root cause lies somewhere between the test cases and the symptoms, and a map is the ideal tool if you are searching for something, in this case the root cause.

This brings several advantages:

  1. The simple visualization renders the complexity of the bug completely transparent. The underlying idea is that software systems are deterministic in nature [3]You cannot implement a true random number generator in SW., so there is no room for black magic. ;-) This is beneficial even if you are working alone on the problem and your bug map is on a piece of paper in front of you.
  2. If you are working with multiple people in multiple locations, it enables a common understanding of the issue to the whole team immediately and continuously. It is a consolidating view[4]Newly found relationships are added, wrong hypotheses are removed, so you’ll only see what’s actually relevant in the bug map. of the issue for the whole team. So even newcomers can jump on board very quickly.
  3. If you need to consult another team about a specific subsystem’s behavior (e.g. a SW component reused from another team or as in this case the CPU), you can take a subgraph of the bug map to them as the problem description. The rest of your system will not be relevant to them, but the suspicious links inside their subsystem will be. The methods on the bug map can be applied recursively on the subgraph, i.e. implementing repro scenarios, observing symptoms etc.
  4. The bug map allows a clear definition of many terms we use throughout debugging: A repro is what you see on the left side. A symptom is what you see on the right side. The path from root cause to symptoms is the excitation. Red herring is a potential cause that turned out to be irrelevant. Root cause is the node that triggered the wrong behavior. A fix is what removes this node from the map, and a workaround is what keeps the node on the map, but changes the links so that it doesn’t lead to the symptoms anymore.
  5. Lastly, the map plays well with a Warcraft analogy. :-) You have a map with some uncharted areas, and you need to make tactical decisions about which areas to explore next. Some areas that you haven’t explored recently might become covered by the “fog of war” over time, so you may need to re-explore them. I guess this analogy will be useful, as it’s hard to find a SW developer who didn’t play Warcraft.

As a final note, even when the fog is somewhat cleared, seeing the root cause in the bug map might still not be straightforward, as it will show you just a lot of causes, but will not reveal which one is the root cause. For this, you still need to have a critical look (supported by the knowledge of the architecture) at the chain of causality and identify which link is wrong.

Bug maps help me a lot in my work, so I wanted to share this method with you. Hope it helps in your bugfixing endeavors. In a further post, I’ll share some more ideas on how to clear up the fog. Please let me know if you came across similar methods, or have some ideas on how to improve this.

Good luck in debugging!

Footnotes

Footnotes
1 Concretely, the chain of causality was something like this: Windows CE had a convention of mapping all addresses as cached and uncached, which is convenient for easy configuration as the 32 bit virtual address space is typically large enough for embedded devices. This caused the CPU to think that some device addresses were also random access addresses like memory addresses, and the modern CPU we had did a lot of speculative access to fill its pipeline in an efficient manner. Speculative access might mean for instance that the CPU might try to load addresses out of bounds because e.g. it predicts that a for loop will iterate once more on an array, while in fact in won’t. So the CPU tried to access an invalid device address, thereby stalling and rendering even access via HW debugger impossible. Loading/unloading of several DLLs during regular test run caused many changes in the page tables stressing the MMU and thereby increased the probability of the bug occurring.
2 Using the new memory configuration of the OS that doesn’t map all addresses cached and uncached
3 You cannot implement a true random number generator in SW.
4 Newly found relationships are added, wrong hypotheses are removed, so you’ll only see what’s actually relevant in the bug map.

Leave a Reply

Your email address will not be published. Required fields are marked *

 

This site uses Akismet to reduce spam. Learn how your comment data is processed.