What Is Root Cause Analysis in software development?
Root cause analysis (RCA) is about getting the root cause of an issue. However, with this method, you’re not just looking for a singular cause—you’re looking for an entire system of causes.
Root cause analysis is an effective management method used among a variety of industries, especially those dependent on technology. Using this approach is similar to the way a doctor treats a patient. He starts from understanding symptoms and then goes for simple tests (e.g. blood work) to analyze the disease’s root cause. When the disease’s root cause is still unknown, the doctor uses complicated methods and tests, e.g. a CT scan to understand it further. He works that way until he narrows down to the root cause of the problem.
We use the same logic when we apply this method to other industries. Here, we’re going to give you an overview of root cause analysis involving technology and software errors, including a definition of what it is and how it works.
Defining Root Cause Analysis
As mentioned above, root cause analysis is a systematic approach for identifying the “root” causes of issues and incidents—the operative word being causes, because there’s never just one reason why events take a turn for the worst or why systems in place fall apart.
Root cause analysis is based on the idea that effective management requires an individual to go beyond simply putting out fires. It’s not enough to just quickly slap a bandaid on a problem when it arises. You must also figure out how to prevent them—and to do that; you have to find out why the problem occurred in the first place – i.e., the root cause.
Of course, to utilize the root cause analysis approach, a problem must occur. This may sound like a dog chasing its tail, but bear with us.
One of the most important things to remember when it comes to root cause analysis is that you’re not just trying to identify one main cause. When you focus on a single cause, it limits your solution potential. It also keeps you from fixing the problem as a whole and being able to prevent it from happening again in the future.
General steps for root cause analysis
- Define the problem
- Collect data relevant to the problem
- Ask why. Identify the root cause
- Implement corrective actions
- Prevent the problem from recurring or causing other problems
- Implement and test the solution
To conduct a Root Cause Analysis systematically, we can use the SMART rule to define the problem:
- Specific,
- Measurable,
- Action-Oriented,
- Relevant,
- Time-Bound.
After that, we can use the 5 Why analysis or Fishbone Analysis (also known as Herringbone or Ishikawa diagram) to find root causes.
Source: https://en.wikipedia.org/wiki/Root_cause_analysis#/media/File:Root_Cause_Analysis_Tree_Diagram.jpg
Getting to the “True Root Cause” in software development
Technologically speaking, when we apply root cause analysis for fixing software errors to find the true root cause, we’re making our way down from defining the problem through data and executing code to the root cause. The situation we could encounter is that we cannot pinpoint the reason behind the issue.
When Define the problem: Deterministic vs Non-deterministic errors
As mentioned before a problem must occur before we can conduct root cause analysis. That means we need to be able to confirm the issue and provide a way to recreate it. It could be tricky even if we have all information hidden in many places like error messages, stack traces, logs, reports from testers or end-users. Software akin to RevDeBug is a great solution in this process.
We can divide software problems into two categories: deterministic and non-deterministic. The first means that we cannot only confirm that something is not working but also provide a good description of when and why it’s not working, as well as steps on how to reproduce it. Having this we can move to understanding symptoms, finding the roots of the problem, and fixing it for good.
Non-deterministic errors are much more problematic. Errors happen randomly, at least as we see it. We will need much more data and additional techniques to understand them. We can provide a fix for non-deterministic errors, but we can’t be sure that after fixing one line of code it won’t break in the next one or end in a much more severe problem.
Generally speaking, we look for determinism in each problem, even if we can’t spot it in the beginning.
Collect the data, establish actionable context: code version, variables, logs, configuration
Context is the king to do effective root cause analysis. Logs, variables, error messages, and knowing which code version runs on which environment all matter when we change non-deterministic cases into deterministic ones.
Which problems are the hardest to solve? Those we don’t log or monitor properly. So when errors happen, we add more logs, generating more data to provide better context for our reasoning. It’s similar to the chicken-egg problem: we add logs when an error happens, but our logging isn’t sufficient to provide correct information about bugs.
When we don’t have data that can provide boundaries for our issue, we need to dig even deeper to find the real root cause. Without the right information, you can’t effectively ask questions and distinguish causes from symptoms.
Ask why: Debugging, Testing, Deploying
From the user’s perspective, an error looks simple – the app is not doing what it should. In theory, it also looks simple. You have error details, logs, variables, whole context, so when you read the code with this set of information, you should be able to relatively easily understand why your app broke.
In reality, one error sends developers on a long journey to find answers to questions like:
- Is this issue limited to one user, or is it a global problem?
- Is this issue connected with a subset of users/ specific data (context)?
- Are there any other similar cases?
- Is this connected with our recent actions (e.g. new software release)?
- …..
The more questions we can answer, the closer we are to determine the real root cause. What happens when we can’t find the answers we are looking for?
We go for sophisticated methods like debugging to better understand the code that isn’t working as it should. It’s often connected with additional testing, recreating environments, or replication attempts. The problem with this approach is that there isn’t any good book or online resource that will make it easier or more predictable to do. Each system is different, and even with inside knowledge finding the root cause could be hard or impossible to do. The process is time-consuming and frustrating when it doesn’t bring any results.
We work on establishing boundaries that will provide determinism to non-deterministic errors. When we can’t recreate the issue, we test different approaches, wait for production failure, or iterate until it happens once more time, and we could better understand what’s happening.
The problematic matter in finding the “true root cause” of software errors is time. When something is not working, we feel pressure to resolve it as soon as possible. When we can’t find answers to our questions, we end up in difficult situations – we don’t know what to fix, so we guess.
Root cause analysis does not just focus on symptoms
Symptoms should lead you to the causes of the errors, but it’s not always true. Some errors pretend to be simple, but in reality, they are hidden in complexity. Take, for example, a distributed system or microservice-based app. The workflow should end by sending an e-mail, but it didn’t.
We could have multiple reasons behind that:
- The code responsible for the sending didn’t work correctly.
- The e-mail service was responsible.
- We didn’t send the message because we didn’t get content.
As you probably know we are on the wild goose chase to find a needle in the haystack. With the current complex, often cloud-based systems, we don’t control many aspects of our software. That means causes are not anymore where we see the symptoms – the errors.
Without proper monitoring tools and the software development life cycle process in place, determining the real root causes is challenging. You could be lucky with most cases, but those we missed even in the most restricted QA process will drive you nuts.
You should remember this rule from statistics which says that correlation is not causation. Understanding the difference between causes and symptoms is critical to provide efficient fixes.
Identify solutions and corrective actions:
When providing a fix for the identified error, we need to think about two aspects – fixing the issue and preventing it from happening in the future. We all encounter the “hotfixes” that stay forever and generate technical debt.
Without finding the root cause of the error, you can’t be sure that your program won’t break in the next line after fixing one line of code. It also means that you could fix what you think is broken, but it isn’t and buries the real problem in technical debt.
Let’s think about the situation when your automated tests start to fail randomly without any apparent changes. When your test fails randomly in a non-deterministic situation, you need to check in both directions your code and test degradation (Do you know what test flakiness is?).
When you touch the tests, you will remove the failure’s non-deterministic aspect, but does it fix the issue? In our experience, 25% of those randomly failing tests cause severe problems on production hard to fix.
Fixing wrong reasons is a technology debt you add to your code. To prevent that during root cause analysis, we focus on fixing the issue and preventing it in the future. We need to confirm the root causes to resolve issues, which is a challenging part of software development.
Implement & test solution:
Interestingly, when you inspect your past fixes, most of them were short and relatively small. Most of the time, you spent figuring out where to put them. Properly conducted root cause analysis should provide you with the answers you are looking for.
The only danger here is that you won’t be able to recreate the situation in the test environments to confirm your fix works correctly. When you run a smaller system, it probably won’t be an issue in most cases. On the other hand, when you maintain larger, complex environments with many dependent systems with multiple configurations, getting it in the correct setup to recreate conditions that will give the ability to test your code could be at least problematic.
Conclusion
Context is the key to troubleshooting. If you don’t have it, you will spend more time figuring out determinism in your issues. Doing the true root cause analysis not only results in casting a wide net; it involves going that extra mile deep into the various facets of code.
Unfortunately, current error monitoring and APM tools provide only a subset of the all information needed for successful error root cause analysis. This results in limited visibility hindered analysis and dependency on a “few select groups of individuals” with intrinsic knowledge of the systems.
While the technology we use in software development evolved rapidly over the last decades, troubleshooting didn’t change much. Companies look for better tools to help them find True Root Cause for application issues within minutes and remove all the guesswork from the process.
RevDeBug is the only tool that provides dev teams with the ability to visualize each failure and provide 100% reproduction for each error. That provides instant True Root Cause analysis for every error and slowdown across the software delivery lifecycle. Visit our website to learn more.