Blue Green Deployment: The Strategy That Makes Rollbacks Painless
by Eric Hanson, Backend Developer at Clean Systems Consulting
The Rollback That Took 45 Minutes
The deployment went out at 2pm. Error rates climbed at 2:07. The on-call engineer identified the cause at 2:19 — a regression in the new version. The rollback started at 2:22. It completed at 3:07, because rolling back a rolling deployment means spinning up old instances and waiting for traffic to shift back through the same gradual rollout process.
45 minutes. That's the cost of not having an idle environment ready to take traffic.
Blue-green deployment exists specifically to eliminate this: when the new version (green) is misbehaving, the old version (blue) is still fully operational and rollback is a load balancer switch — typically under 60 seconds.
How Blue-Green Works
The model is straightforward. You maintain two production environments — blue and green — that are structurally identical. At any given time, one is live (serving production traffic) and one is idle (the previous version, kept warm).
Deployment works like this:
- The idle environment (green, assuming blue is live) is updated with the new version
- The new version is validated against production data with production configuration, but without serving live traffic — smoke tests, health checks, sanity checks
- Traffic is switched from blue to green at the load balancer level — a single configuration change
- Blue remains running, serving no traffic, for a defined warm rollback window (typically 24–48 hours)
- If something goes wrong post-switch, rollback is switching the load balancer back to blue — under 60 seconds
┌──────────┐ ┌──────────────────────────────┐
Traffic │ ALB / │──────▶│ Blue (v1.2 - currently live)│
│ NLB │ └──────────────────────────────┘
└──────────┘
┌──────────────────────────────┐
│ Green (v1.3 - being deployed)│
└──────────────────────────────┘
After validation and switch:
┌──────────┐ ┌──────────────────────────────┐
Traffic │ ALB / │──────▶│ Green (v1.3 - now live) │
│ NLB │ └──────────────────────────────┘
└──────────┘
┌──────────────────────────────┐
│ Blue (v1.2 - rollback ready)│
└──────────────────────────────┘
Implementing with AWS Elastic Load Balancer and ECS
The AWS implementation uses target groups to represent each environment:
# Create two target groups: one per environment
aws elbv2 create-target-group \
--name myapp-blue \
--protocol HTTP \
--port 8080 \
--vpc-id $VPC_ID \
--health-check-path /actuator/health \
--health-check-interval-seconds 10
aws elbv2 create-target-group \
--name myapp-green \
--protocol HTTP \
--port 8080 \
--vpc-id $VPC_ID \
--health-check-path /actuator/health
# Deploy to the inactive target group
# (Identify which is currently active via a tag or a config parameter)
INACTIVE_TG=$(get_inactive_target_group) # Your logic here
# Update the ECS service to use the new task definition
aws ecs update-service \
--cluster production \
--service myapp-${INACTIVE_ENV} \
--task-definition myapp:${NEW_VERSION}
# Wait for the service to stabilize
aws ecs wait services-stable \
--cluster production \
--services myapp-${INACTIVE_ENV}
# Run smoke tests against the inactive environment directly
# (Bypass the load balancer, hit the target group directly)
run_smoke_tests $INACTIVE_TARGET_GROUP_URL
# Switch traffic
aws elbv2 modify-listener \
--listener-arn $LISTENER_ARN \
--default-actions Type=forward,TargetGroupArn=${INACTIVE_TG_ARN}
The Database Problem
Blue-green deployment is clean for stateless services. The database introduces a complication: both environments share it, which means any schema change must be backward-compatible with both the old and new application version simultaneously.
The pattern to follow is expand-contract (also called parallel change):
- Expand: add the new column or table, make it nullable, deploy the new version that writes to both old and new structure
- Contract: once the old version is fully retired (after the rollback window closes), clean up the old structure
-- Release N: add new column as nullable (old version ignores it)
ALTER TABLE payments ADD COLUMN idempotency_key VARCHAR(64) NULL;
-- Release N: new version writes to both payment_reference (old) and idempotency_key (new)
-- Release N+1: old column is removed, new column gets NOT NULL constraint
This requires more migration steps per schema change, but it makes blue-green safe for services with database persistence — which is most services.
When Blue-Green Is Worth the Cost
The infrastructure cost is real: you're running two environments simultaneously during deployments, and the idle environment consumes resources. For services on Kubernetes or ECS, this typically means 1.3–1.5× the normal running cost during the deployment window.
Blue-green is worth the cost when: rollback speed is critical (financial services, auth systems, anything where a 45-minute rollback means 45 minutes of lost revenue or compromised security), when the change is high-risk (major version upgrades, significant behavior changes), or when the team has had multiple painful rolling-deployment rollbacks that would have been avoided with blue-green.
It's not necessary for every service. It's essential for the ones where it matters. Know which ones those are before the 2pm incident.