Rollback Is Not Failure. Not Having One Is.

January 18, 2026

by Arif Ikhsanudin, Backend Developer

The Shame Around Rollback

In many engineering organizations, triggering a rollback is culturally loaded. It means "something went wrong." The implicit expectation is that good deployments don't need rollbacks — the code was tested, the pipeline was green, the engineer should have been more careful. Rolling back feels like public failure.

This framing is actively dangerous. It makes engineers hesitant to roll back when they should, which extends incident duration. It creates pressure to "fix forward" on broken deployments when rollback would be faster and safer. And it incentivizes hiding rollbacks from postmortem documentation, which means the team can't learn from them.

Rollback is not failure. Rollback is a deployment control mechanism — as deliberate and engineered as the forward deployment. A team that rolls back quickly has a shorter mean time to recovery. A team that avoids rollback out of embarrassment has a longer one. Which team would you rather be on?

What a Real Rollback Plan Looks Like

"We can roll back" is not a rollback plan. A rollback plan answers five specific questions:

Who can trigger it? Any on-call engineer, not just the person who deployed. Rollback during an incident should never be blocked by key-person dependency.

What exact command or pipeline step triggers it? Documented, not reconstructed under pressure. Ideally a single command or a button in your deployment UI.

How long does it take? Measured from previous rollbacks, not estimated. A rolling deployment rollback in Kubernetes via kubectl rollout undo typically takes the same time as a forward rollout — 5–15 minutes for a typical service. Blue-green rollback is under 60 seconds. Know your number.

What are the database implications? This is the hardest question. If the deployment ran a non-reversible migration, rolling back the application code doesn't restore the previous state. The rollback plan must account for this.

How do you verify the rollback succeeded? Specific health checks, specific error rate thresholds, specific user-facing behaviors to validate. Not "it looks better."

The Database Migration Problem in Rollbacks

The most common reason rollbacks fail or are avoided: the deployment ran a database migration that the previous version can't handle.

-- This migration makes the previous version incompatible:
ALTER TABLE payments ALTER COLUMN amount TYPE DECIMAL(19,4);
-- If v1.2 expects INTEGER, it will fail with a type error against this schema

The solution is writing migrations to be backward-compatible for at least one release cycle. The expand-contract pattern applies here:

-- Release 1: Add new column alongside old one (v1.1 writes to old, v1.2 writes to both)
ALTER TABLE payments ADD COLUMN amount_decimal DECIMAL(19,4) NULL;

-- Release 2: v1.3 reads from new column, writes to both; v1.2 still works
-- Release 3: Drop old column (v1.3 now exclusively uses new column; rollback to v1.2 no longer supported)
ALTER TABLE payments DROP COLUMN amount;

This means every backward-incompatible schema change takes three releases instead of one. The tradeoff is the ability to roll back releases 1 and 2. For a release cadence of once per week, this adds two weeks of migration horizon. That's a reasonable cost for reliable rollback capability.

Testing Rollback Before You Need It

A rollback plan that's never been tested is a rollback theory. The procedure that sounds straightforward in a calm planning meeting will reveal hidden dependencies, missing permissions, and undocumented state when executed at 2am during an incident.

Schedule rollback drills:

Deploy a non-breaking change to staging
Verify it's working
Execute the rollback procedure
Verify the previous version is restored and healthy
Measure the time from rollback trigger to healthy state

Do this monthly. Rotate who executes it. The goal is that rollback becomes boring — a routine procedure that any on-call engineer can complete in the expected time without consulting documentation.

The Deployment Confidence Loop

Counterintuitively, investing in rollback capability makes teams more willing to deploy, not less. When you know that a bad deployment can be reversed in under 5 minutes by any on-call engineer without database complications, the cost of a bad deployment is bounded. Bounded risk enables more aggressive deployment frequency.

Teams without good rollback tend to be conservative about what they deploy and when — deploying large batches infrequently because each deployment is high-stakes. Teams with good rollback deploy small batches frequently, because each deployment is reversible.

Without rollback capability:
  Deploy risk: HIGH → Deploy frequency: LOW → Batch size: LARGE → Deploy risk: HIGHER

With rollback capability:
  Deploy risk: LOW → Deploy frequency: HIGH → Batch size: SMALL → Deploy risk: LOWER

Build the rollback. Deploy more often. The two are not in tension — they're the same investment.

Our offices

Follow us

Rollback Is Not Failure. Not Having One Is.

The Shame Around Rollback

What a Real Rollback Plan Looks Like

The Database Migration Problem in Rollbacks

Testing Rollback Before You Need It

The Deployment Confidence Loop

Scale Your Backend - Need an Experienced Backend Developer?

Tell us about your project

Our offices

More articles

Why Backend Developers Carry Responsibilities They Never Signed Up For

The Decorator Pattern in Ruby — Clean Code Without the Bloat

Networking Strategies for Remote Consultants

Git Hooks: Automate the Checks Your Team Keeps Forgetting