Rollback Is Not Failure. Not Having One Is.
by Arif Ikhsanudin, Backend Developer
The Shame Around Rollback
In many engineering organizations, triggering a rollback is culturally loaded. It means "something went wrong." The implicit expectation is that good deployments don't need rollbacks — the code was tested, the pipeline was green, the engineer should have been more careful. Rolling back feels like public failure.
This framing is actively dangerous. It makes engineers hesitant to roll back when they should, which extends incident duration. It creates pressure to "fix forward" on broken deployments when rollback would be faster and safer. And it incentivizes hiding rollbacks from postmortem documentation, which means the team can't learn from them.
Rollback is not failure. Rollback is a deployment control mechanism — as deliberate and engineered as the forward deployment. A team that rolls back quickly has a shorter mean time to recovery. A team that avoids rollback out of embarrassment has a longer one. Which team would you rather be on?
What a Real Rollback Plan Looks Like
"We can roll back" is not a rollback plan. A rollback plan answers five specific questions:
Who can trigger it? Any on-call engineer, not just the person who deployed. Rollback during an incident should never be blocked by key-person dependency.
What exact command or pipeline step triggers it? Documented, not reconstructed under pressure. Ideally a single command or a button in your deployment UI.
How long does it take? Measured from previous rollbacks, not estimated. A rolling deployment rollback in Kubernetes via kubectl rollout undo typically takes the same time as a forward rollout — 5–15 minutes for a typical service. Blue-green rollback is under 60 seconds. Know your number.
What are the database implications? This is the hardest question. If the deployment ran a non-reversible migration, rolling back the application code doesn't restore the previous state. The rollback plan must account for this.
How do you verify the rollback succeeded? Specific health checks, specific error rate thresholds, specific user-facing behaviors to validate. Not "it looks better."
The Database Migration Problem in Rollbacks
The most common reason rollbacks fail or are avoided: the deployment ran a database migration that the previous version can't handle.
-- This migration makes the previous version incompatible:
ALTER TABLE payments ALTER COLUMN amount TYPE DECIMAL(19,4);
-- If v1.2 expects INTEGER, it will fail with a type error against this schema
The solution is writing migrations to be backward-compatible for at least one release cycle. The expand-contract pattern applies here:
-- Release 1: Add new column alongside old one (v1.1 writes to old, v1.2 writes to both)
ALTER TABLE payments ADD COLUMN amount_decimal DECIMAL(19,4) NULL;
-- Release 2: v1.3 reads from new column, writes to both; v1.2 still works
-- Release 3: Drop old column (v1.3 now exclusively uses new column; rollback to v1.2 no longer supported)
ALTER TABLE payments DROP COLUMN amount;
This means every backward-incompatible schema change takes three releases instead of one. The tradeoff is the ability to roll back releases 1 and 2. For a release cadence of once per week, this adds two weeks of migration horizon. That's a reasonable cost for reliable rollback capability.
Testing Rollback Before You Need It
A rollback plan that's never been tested is a rollback theory. The procedure that sounds straightforward in a calm planning meeting will reveal hidden dependencies, missing permissions, and undocumented state when executed at 2am during an incident.
Schedule rollback drills:
- Deploy a non-breaking change to staging
- Verify it's working
- Execute the rollback procedure
- Verify the previous version is restored and healthy
- Measure the time from rollback trigger to healthy state
Do this monthly. Rotate who executes it. The goal is that rollback becomes boring — a routine procedure that any on-call engineer can complete in the expected time without consulting documentation.
The Deployment Confidence Loop
Counterintuitively, investing in rollback capability makes teams more willing to deploy, not less. When you know that a bad deployment can be reversed in under 5 minutes by any on-call engineer without database complications, the cost of a bad deployment is bounded. Bounded risk enables more aggressive deployment frequency.
Teams without good rollback tend to be conservative about what they deploy and when — deploying large batches infrequently because each deployment is high-stakes. Teams with good rollback deploy small batches frequently, because each deployment is reversible.
Without rollback capability:
Deploy risk: HIGH → Deploy frequency: LOW → Batch size: LARGE → Deploy risk: HIGHER
With rollback capability:
Deploy risk: LOW → Deploy frequency: HIGH → Batch size: SMALL → Deploy risk: LOWER
Build the rollback. Deploy more often. The two are not in tension — they're the same investment.