Debugging: a detailed look at failures

SW-intensive systems are quite complicated. When they are not working as expected, this complexity increases even further dramatically. This is where debugging comes in, and a solid methodology is essential in approaching systems of such complexity.

In a series of posts, I’ll try to summarize the debugging methodology I have picked up from trainings, books and of course colleagues. In this first post, I’d like to start with some theory as I couldn’t find a good summary on the web directly fitting to my purposes. Those who get bored with theory ;-) are welcome to jump to the end of this post for the definitions of symptom, root cause and excitation, and continue to the upcoming more practical posts. So, here goes…

To catch a bug, you have to “think” like one. So, in order to understand the nature of a bug, I’ll extend the scope a little further and include random failures in the discussion.[1]Although this classification makes the first computer bug not a bug, but a random failure. ;-)

Random failure
These happen due to uncontrollable conditions, e.g. corruption of data in RAM due to interference. Therefore they are not reproducible. Fault-injection is used to create similar failures, but it’s not possible to create the exact same failure as observed initially. SW never exhibits random failures, however the underlying HW may. Therefore random failures are commonly called as HW failures, as well.
Systematic failure (bug)
These happen due to controllable conditions, e.g. a bug in code or a fundamental flaw in system architecture. Therefore they are reproducible, at least in theory. In practice it might still be difficult to identify all necessary test parameters and apply them again. Systematic failures are mostly specific to SW, therefore systematic failures are commonly called SW failures. However HW can also exhibit them, see errata documents for plenty of examples, so I choose to use random/systematic instead of HW/SW.
Systematic Failure: Error->Defect->Failure Random Failure: Fault->Error->Failure
Cascading phases of random & systematic failures

Secondly, both systematic and random errors have intrinsic cascading phases as shown in the figure above[2]After some discussion and research, I revised some of the terminology here.. Let’s look at two examples:

Systematic failure: Invalid network packet

  1. Error: Developer misunderstands a detail in network protocol specification
  2. Defect: Code for filling in network packet’s fields is wrong
  3. Failure: Device cannot communicate with certain types of servers
Random failure: Corrupt Ethernet frame

  1. Fault: Electrical fluctuation on an Ethernet line
  2. Error: Corrupt Ethernet frame due to the fluctuation
  3. Failure: None, because the TCP/IP checksum mechanism detected the corrupt frame and corrected the failure by requesting a re-send of the corrupt frame.

The characteristic property of the third phase (failure) is that it’s externally observable by the user of the system. It’s commonly called symptom. The second phase’s (defect/error) characteristic is that it’s observable by looking inside of the system. First phase (error/fault) is mostly transient, thereby not observable anymore after it happened.

Each phase is a necessary but not a sufficient condition for entering the next phase. The previous phases constitute the root cause(s) of the issue, while the additional conditions for entering the phase constitute the excitation.

In future posts, I will continue with how to organize the debugging work, especially for time-taking bugs involving multiple people. Have a nice time until then. :-)

Footnotes

Footnotes
1 Although this classification makes the first computer bug not a bug, but a random failure. ;-)
2 After some discussion and research, I revised some of the terminology here.

One thought on “Debugging: a detailed look at failures

Leave a Reply

Your email address will not be published. Required fields are marked *

 

This site uses Akismet to reduce spam. Learn how your comment data is processed.