Root Cause Analysis: Stop Fixing Symptoms and Start Fixing Problems
by Eric Hanson, Backend Developer at Clean Systems Consulting
The Incident That Recurred
A production service went down because a database migration was run without first checking that the migration would complete before the deployment timeout. The deploy timed out, the service started with a half-migrated schema, and requests started failing.
The fix: roll back the deploy, complete the migration manually, redeploy. Root cause identified in the post-mortem: "engineer error — migration wasn't tested against production data volume."
Six months later, a different engineer made the same mistake with a different migration. The root cause analysis had identified the symptom — an engineer made a mistake — but not the system property that allowed the mistake to happen repeatedly: there was no mechanism to test migration duration before deploying, and the deployment pipeline had no step that would catch this.
Fixing "engineer error" is not fixing the root cause. Engineers will continue to make this class of error until the system is changed to make the error either impossible or immediately detectable.
What Root Cause Analysis Is Actually Doing
RCA is not a blame exercise and it's not a comprehensive investigation of everything that went wrong. It is a structured inquiry into the system properties that allowed the failure to occur.
The key insight from resilience engineering (Hollnagel, Woods, and others in the field): complex systems don't fail because of single causes. They fail because multiple defenses that should have prevented the failure were either absent, degraded, or misaligned. The useful question is not "who made the mistake" but "what conditions made this mistake possible, and why didn't our systems catch it?"
This framing leads to different and more useful fixes.
The Structure of a Useful Post-Mortem
A useful post-mortem is not a timeline with a paragraph at the bottom listing corrective actions. It is an analysis that answers:
What was the direct trigger? The specific action or failure that immediately caused the incident. (The migration ran during deploy.)
What conditions allowed the trigger to cause an incident? The systemic properties that made the trigger harmful rather than harmless. (No automated check on migration duration. Deploy pipeline doesn't validate migration safety. Monitoring didn't catch the partial failure state before it affected users.)
What early warning signs were present but not acted on? Signals that the incident was approaching that weren't caught. (Staging deployment with a much smaller dataset hadn't surfaced the duration issue. Latency on the database was already elevated before the deploy.)
What would have prevented the incident? Changes to the system, not to individuals. (Automated migration duration estimation in the deploy pipeline. Canary deployment with health check before full rollout. Migration rollback automation.)
What specifically will be changed? Not "be more careful" — specific, named changes with owners and timelines.
The Five Whys in RCA
The same Five Whys methodology applies at the system level:
- Why did the service fail? Requests returned 500 errors during deploy.
- Why did requests fail? Schema was in a partially migrated state.
- Why was the schema partially migrated? The migration ran during deploy and the deploy timed out.
- Why did the deploy time out before the migration completed? The migration took 45 minutes on the production dataset; the deploy timeout is 10 minutes.
- Why wasn't the migration duration known before deploy? No tooling or process to estimate migration duration on production data volume.
The root cause is absent tooling. The fix is tooling, not training.
The Blameless Culture Requirement
RCA only works in a culture where people feel safe providing complete and accurate information about what happened. When post-mortems produce blame, engineers learn to provide incomplete information and to point away from their own actions. The result is post-mortems with sanitized timelines and corrective actions that say "better communication."
Blameless doesn't mean consequence-free. It means the analysis is directed at systems, not at individuals. The question is "what about the system allowed this to happen?" not "why did this person do this?"
The Practical Takeaway
For your team's next production incident, add one question to the post-mortem template: "If a different engineer made the same mistake, would our current systems catch it?" If the answer is no, that's your highest-priority corrective action. The fix that prevents the next person from making the same mistake is worth more than the fix that addresses the specific instance.