This publication was originally posted in spanish on November 4th, 2022
I am thrilled to share this publication on a topic I find fascinating for both its usefulness in the software industry and for its complexity relative to the context in which it is implemented, as well as the roles involved.
The Root Cause Analysis or RCA is a process used in Six Sigma that allows a team to understand a problem as fully as possible. In the case of Software Engineering this process is commonly applied when defects occur within a Software Development Project, resulting in the identification of their main (or most likely) causes.
The main objectives of Root Cause Analysis are:
- Better define the criticality of the problem by identifying its causes
- Better assess the impact and its implications
- Identify and implement risk mitigation measures to reduce risk (or even avoid) future incidents.
RCA It is all about getting to the source of problems (variations, problems, or defects) by using tools that can be statistical and non-statistical in nature.
The Process
The RCA process is well documented and described elsewhere, but the general steps are:
- Problem definition
- Identification of the cause or causes through tools such as:
- Cause and effect diagrams
- Matrices and relational diagrams
- Analysis of the five whys
- Fault Tree Analysis (FTA)
- Preparation of written report
- Redefinition of problems (if necessary)
- Impact analysis
- Criticality assessment
- Categorization of the problem
- Proposals for mitigation measures
For more information about RCA and each of the steps and tools, you can consult specific literature on the subject or on Six Sigma.
Proposal of actions and responsibilities for the RCA in an agile software project
Depending on the environment in which the problem was identified or how far it seeped into the chain of environments within the software development ecosystem, there are different responsibilities and actions that can vary between the roles defined for a project. The roles considered for the following proposal are those generally defined for Scrum-type projects such as Software Development Engineer (Dev), Software Quality Engineer (QA), Scrum Master (SM) and Product Owner (PO).
Development and test environments
If the problem or defect was identified in the lower development environments (local / dev) or testing (qa / testing), it is inferred that the problem was detected as part of the quality assurance process, so only the following actions and responsibilities are proposed:
- QAs: Responsible for problem logging, due diligence, and written reporting. QAs can also assist in debugging the problem (e.g., preparing test data for problem reproducibility) and must retest the implemented solution.
- Devs: Responsible for determining the real cause and providing all the necessary information for the written report, as well as presenting potential solutions and implementing the one that has been approved.
- SM: Validates the correct recording of information in the project history (e.g., the defect log linked to the written RCA report)
Non-productive environments
If the problem or defect was found in lower non-developmental but also non-productive environments (e.g., staging, UAT, pre-prod) in addition to the actions described above for dev and qa environments, each role must:
- QAs: Find the quality process that did not detect or stopped the problem in the previous environments (dev, qa) so that it does not “escape” again (e.g., by reviewing the testing strategy and quality checkpoints). QAs are also responsible for overseeing the implementation (updating or creation) of the process, guide, policy, or quality checkpoints ideated to reduce or even avoid similar incidents.
- Devs: Responsible for developing the impact analysis (e.g., number of users affected, type and level of impact) as well as providing technical and functional information in collaboration with business analysts (BAs), POs and end users to determine the level of criticality (if necessary).
- SM: Ensures that ideas proposed as mitigation measures (e.g., improvement processes) derived from the RCA are discussed in the retrospective and coordinates their subsequent implementation. It is also responsible for reviewing the effectiveness of changes resulting from the implementation of such mitigation measures.
- PO: is informed of the initial problem or defect, the result of the RCA, as well as about the solution that was given, supports prioritizing the attention to the problem and provides feedback to the team on proposed solutions.
Productive environments
If the problem or defect was found in the production environments, it is assumed that the impact is transferred to the end users, so the following additional actions are proposed to those described for the previous environments:
- QAs: Support in the identification of functionalities, conditions, and scenarios with a potential risk of occurrence of the same problem or defect and their inclusion as suggestions for revision in the mitigation measures of the RCA.
- Devs: Lead the collaboration with other roles (such as QA and DevOps) to analyze and reinforce the processes and controls of deployment in the environments, as well as the determination of the solution and the effort required to solve the functionalities identified with potential risk of occurrence of the same incident.
- SM: Documentation of the Backlog resulting from the identification of scenarios with similar risk potentials. Preparation and submission of the detailed report of the production incident to the PO.
- PO: Receives and signoff the detailed production incident report, collaborates with the team on mitigation proposal sessions. It is responsible for monitoring proposed actions that are beyond the reach of the team.
It should be noted that the actions and responsibilities proposed by each role, accumulate as the issue is found later the chain of deployment of environments, that is, the actions and responsibilities as well as the number of roles involved, increase as the problem or defect occurs later in the development cycle or in the environments closest to production. This in turn makes the process and its evidence more complex and specific, requiring greater and closer collaboration as the number of roles involved increases; however, note that all team members (all roles) are responsible for the proposals for process improvements or mitigation measures for each incident, both from the perspective of their own role and from the perspective of shared processes and tools, as well as the overall dynamics of the team.
Additionally, it is suggested that the proposed actions be carried out within the Sprint period in which the problem or defect is detected; however, in cases where these are classified with a low priority, all actions (except for the registration of the defect in the log) can be postponed, with the consequent accumulation of technical debt for the team.
What deliverables correspond to this proposal?
The RCA process was designed to provide “actionable” information, i.e., on which consequential decisions and actions can be made. This requires the development of the following documents and deliverables during the process:
- RCA Report: is a formal document that describes the steps that were followed during the analysis, as well as the results of each of them. This document should include the dates (or periods) that comprise from the injection of the problem into the environment, its identification and up to its resolution, the classification of the criticality of the problem or defect and its category (see publication on categorization of software defects), as well as the names of the people involved in analysis, solution and its approval.
- Updated Problem/defect definition: in many cases, the RCA process leads to a clearer definition of the problem, so its description must be updated (for example, in the defect log). An example is when a user reports the problem “There’s no internet” and when performing the analysis it is determined that there is network connectivity between the user’s computer and the Internet, but the cause of the failure is that the DNS service (one of the main components that enables among other things web browsing) is not working. The problem is then redefined as “DNS System Malfunction” and the causes are determined.
- Cause / Diagnosis: This section describes in detail the probable causes or causes determined by the analysis, as well as the process that was used to reach them and all the technical or documentary information of support.
- Proposals for mitigating the occurrence of new instances: This document describes the measures proposed to reduce or cancel the possibility that the problem or defect will recur or not be detected in time, including the detail of its implementation, planning, estimation of the effort required, as well as those responsible for carrying it out and supervising it.
Examples of general mitigation measures
Listed below are some examples of general mitigation measures that can serve as a starting point for an RCA process.
- Determination of technical, functional, or business process training needs for team members
- Update of quality assurance deliverables (e.g., test scenarios and scenarios, test plan and/or strategy, automations, test data)
- Establishment of quality control points in the development and debugging process (e.g., code reviews, greater coverage of unit tests, use of automatic static code analysis tools or unattended execution of automated tests by deployment and by environment or “nightly builds”)
- Correct assignment of tasks according to the role and experience of team members
- Establishment of policies, guidelines, and processes, as well as their documentation, knowledge, and accessibility for team members
- Documentation and management of team knowledge, as well as cross-training
- Scheduling of events that affect the capacity of the team (e.g., vacations of the most experienced members, code-freezes or other team’s deployments)
- Emphasis on identifying and communicating technical or external team interdependencies during sprint planning.
- Logging and tracking effort progress by role during the Sprint (e.g., to avoid causing a bottleneck by delaying code deployment or reducing estimated times for environment preparation, data, and test execution)
Final Comments
The RCA process can be a valuable tool to determine the causes and propose solutions to problems inherent in software development, regardless of their cause and the context in which they arise; however, although there are multiple examples of success stories in various industries, sometimes there is no clarity in the definition of the actions, nor the type and level of involvement of the roles within the context of a software development team.
I have also encountered those who refuse to use RCA in agile teams arguing aspects such as “formal processes take away the agility from the team” or “agile prioritizes functional software and documentation takes away valuable time”. In these cases, experience has shown that agile teams are perfectly capable of adapting to the use of RCAs as part of their dynamics and that their advantages outweigh the arguments against them (especially in cases when the project is already having difficulties due to lack of quality).
It should be noted that this publication emphasizes the actions and responsibilities proposed for the roles involved, as well as the usefulness of deliverable evidence, but does not delve into the RCA process itself. This is intended to keep the text short and focused on the proposal, as the RCA process is very well documented in many and varied sources.
I hope this information is helpful in starting conversations about how to implement RCA on your own teams.