What is Root Cause Analysis?
The National Institute of Standards and Technology (NIST) defines root cause analysis as, “A principle-based, systems approach for the identification of underlying causes associated with a particular set of risks.”1 In practice, root cause analysis (RCA) can be understood as a method of problem-solving that is used to investigate known problems and identify their antecedent and underlying causes. Organizations use RCA in the incident response process to determine where breaches originated, to address the vulnerability that caused the breach, and to take corrective and preventative actions to prevent a future recurrence. Root causes can be divided into three types:
- Physical. These include hardware failures, system errors from booting up, issues with tools not functioning, or other tangible components breaking down.
- Human. Human root causes arise from human errors or mistakes, such as those that may occur if a person does not have the necessary skills to operate systems properly, does not know the tools, creates a programming error, or tries to perform tasks with incorrect tools.
- Organizational. Organizational causes arise from administrative issues, such as those arising from poor communication, incorrect instructions, insufficient oversight, understaffing, or other administrative shortfalls.
There are six steps to RCA:
- Defining the event. Accurately describe the event using as much detail as possible. It may be helpful to prepare a list of questions to help to define the event. Examples of questions include:
- What happened?
- When did it happen?
- Where did it happen?
- How did you respond?
- What systems were involved?
- Is it contained?
- What is the impact?
- Finding the causes. Explore all of the possible causes, looking for as many potential factors as possible. Include many different people and departments in this step for a more complete picture of the causal factors. Brainstorming, process mapping, and using tools like fishbone diagrams may help with these revelations.
- Finding the root cause. Use available tools such as SIEMs or logs to help with the process of identifying the root cause. Additionally, the process known as “5 Whys” is the practice of asking “Why?” repeatedly whenever a problem is encountered, in order to get beyond the obvious symptoms to discover the root cause. An example: Problem statement: The floor is wet.
- Why? The overhead pipe is leaking.
- Why? The water pressure is too high.
- Why? There is a faulty control valve.
- Why? Control valves have not been tested.
- Why? The control valves were not included on the maintenance schedule.
- Finding the solutions. After discovering the root cause, brainstorm about possible solutions. Similar to exploring the many possible contributing factors, exploring the many possible solutions can be helpful so that the most prudent solution is selected.
- Taking action. Taking action involves following through on the proposed solution.
- Verifying solution effectiveness. Verify if the solutions worked by checking for their effectiveness. If you used the 5 Whys, then you can review those to see if the solution solved them.
- Why? The overhead pipe is leaking.
- Why? The water pressure is too high.
- Why? There is a faulty control valve.
- Why? Control valves have not been tested.
- Why? The control valves were not included on the maintenance schedule.
1 NIST, 2023, “Root Cause Analysis”