Root Cause Analysis: Stop Fixing Symptoms and Start Fixing Problems

by Eric Hanson, Backend Developer at Clean Systems Consulting

The Incident That Recurred

A production service went down because a database migration was run without first checking that the migration would complete before the deployment timeout. The deploy timed out, the service started with a half-migrated schema, and requests started failing.

The fix: roll back the deploy, complete the migration manually, redeploy. Root cause identified in the post-mortem: "engineer error — migration wasn't tested against production data volume."

Six months later, a different engineer made the same mistake with a different migration. The root cause analysis had identified the symptom — an engineer made a mistake — but not the system property that allowed the mistake to happen repeatedly: there was no mechanism to test migration duration before deploying, and the deployment pipeline had no step that would catch this.

Fixing "engineer error" is not fixing the root cause. Engineers will continue to make this class of error until the system is changed to make the error either impossible or immediately detectable.

What Root Cause Analysis Is Actually Doing

RCA is not a blame exercise and it's not a comprehensive investigation of everything that went wrong. It is a structured inquiry into the system properties that allowed the failure to occur.

The key insight from resilience engineering (Hollnagel, Woods, and others in the field): complex systems don't fail because of single causes. They fail because multiple defenses that should have prevented the failure were either absent, degraded, or misaligned. The useful question is not "who made the mistake" but "what conditions made this mistake possible, and why didn't our systems catch it?"

This framing leads to different and more useful fixes.

The Structure of a Useful Post-Mortem

A useful post-mortem is not a timeline with a paragraph at the bottom listing corrective actions. It is an analysis that answers:

What was the direct trigger? The specific action or failure that immediately caused the incident. (The migration ran during deploy.)

What conditions allowed the trigger to cause an incident? The systemic properties that made the trigger harmful rather than harmless. (No automated check on migration duration. Deploy pipeline doesn't validate migration safety. Monitoring didn't catch the partial failure state before it affected users.)

What early warning signs were present but not acted on? Signals that the incident was approaching that weren't caught. (Staging deployment with a much smaller dataset hadn't surfaced the duration issue. Latency on the database was already elevated before the deploy.)

What would have prevented the incident? Changes to the system, not to individuals. (Automated migration duration estimation in the deploy pipeline. Canary deployment with health check before full rollout. Migration rollback automation.)

What specifically will be changed? Not "be more careful" — specific, named changes with owners and timelines.

The Five Whys in RCA

The same Five Whys methodology applies at the system level:

  1. Why did the service fail? Requests returned 500 errors during deploy.
  2. Why did requests fail? Schema was in a partially migrated state.
  3. Why was the schema partially migrated? The migration ran during deploy and the deploy timed out.
  4. Why did the deploy time out before the migration completed? The migration took 45 minutes on the production dataset; the deploy timeout is 10 minutes.
  5. Why wasn't the migration duration known before deploy? No tooling or process to estimate migration duration on production data volume.

The root cause is absent tooling. The fix is tooling, not training.

The Blameless Culture Requirement

RCA only works in a culture where people feel safe providing complete and accurate information about what happened. When post-mortems produce blame, engineers learn to provide incomplete information and to point away from their own actions. The result is post-mortems with sanitized timelines and corrective actions that say "better communication."

Blameless doesn't mean consequence-free. It means the analysis is directed at systems, not at individuals. The question is "what about the system allowed this to happen?" not "why did this person do this?"

The Practical Takeaway

For your team's next production incident, add one question to the post-mortem template: "If a different engineer made the same mistake, would our current systems catch it?" If the answer is no, that's your highest-priority corrective action. The fix that prevents the next person from making the same mistake is worth more than the fix that addresses the specific instance.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

Spring Boot API Rate Limiting — rack-attack Equivalent in Java

Rate limiting protects APIs from abuse, enforces fair usage, and prevents accidental runaway clients from taking down infrastructure. Here is how to implement per-user, per-IP, and per-endpoint rate limiting in Spring Boot with Bucket4j and Redis.

Read more

Why Mandatory Camera Meetings Are Often Unproductive

Being on camera all day sounds professional, but it can actually kill focus and morale. Here’s why forcing cameras on every meeting might be doing more harm than good.

Read more

Turning One Contract Into a Long Term Relationship

A single successful contract is valuable. A long-term relationship with the client who gave it to you is worth multiples of that — in income, in referrals, and in the kind of work you get to do.

Read more

How to Estimate Time for Projects You’ve Never Done Before

Estimating a project you’ve never tackled can feel like guessing the weather on Mars. But with the right approach, you can make surprisingly accurate predictions.

Read more