Blue Green Deployment: The Strategy That Makes Rollbacks Painless

by Arif Ikhsanudin, Backend Developer

The Rollback That Took 45 Minutes

The deployment went out at 2pm. Error rates climbed at 2:07. The on-call engineer identified the cause at 2:19 — a regression in the new version. The rollback started at 2:22. It completed at 3:07, because rolling back a rolling deployment means spinning up old instances and waiting for traffic to shift back through the same gradual rollout process.

45 minutes. That's the cost of not having an idle environment ready to take traffic.

Blue-green deployment exists specifically to eliminate this: when the new version (green) is misbehaving, the old version (blue) is still fully operational and rollback is a load balancer switch — typically under 60 seconds.

How Blue-Green Works

The model is straightforward. You maintain two production environments — blue and green — that are structurally identical. At any given time, one is live (serving production traffic) and one is idle (the previous version, kept warm).

Deployment works like this:

  1. The idle environment (green, assuming blue is live) is updated with the new version
  2. The new version is validated against production data with production configuration, but without serving live traffic — smoke tests, health checks, sanity checks
  3. Traffic is switched from blue to green at the load balancer level — a single configuration change
  4. Blue remains running, serving no traffic, for a defined warm rollback window (typically 24–48 hours)
  5. If something goes wrong post-switch, rollback is switching the load balancer back to blue — under 60 seconds
         ┌──────────┐       ┌──────────────────────────────┐
Traffic  │  ALB /   │──────▶│  Blue (v1.2 - currently live)│
         │  NLB     │       └──────────────────────────────┘
         └──────────┘
                            ┌──────────────────────────────┐
                            │  Green (v1.3 - being deployed)│
                            └──────────────────────────────┘

After validation and switch:

         ┌──────────┐       ┌──────────────────────────────┐
Traffic  │  ALB /   │──────▶│  Green (v1.3 - now live)     │
         │  NLB     │       └──────────────────────────────┘
         └──────────┘
                            ┌──────────────────────────────┐
                            │  Blue (v1.2 - rollback ready)│
                            └──────────────────────────────┘

Implementing with AWS Elastic Load Balancer and ECS

The AWS implementation uses target groups to represent each environment:

# Create two target groups: one per environment
aws elbv2 create-target-group \
  --name myapp-blue \
  --protocol HTTP \
  --port 8080 \
  --vpc-id $VPC_ID \
  --health-check-path /actuator/health \
  --health-check-interval-seconds 10

aws elbv2 create-target-group \
  --name myapp-green \
  --protocol HTTP \
  --port 8080 \
  --vpc-id $VPC_ID \
  --health-check-path /actuator/health

# Deploy to the inactive target group
# (Identify which is currently active via a tag or a config parameter)
INACTIVE_TG=$(get_inactive_target_group)  # Your logic here

# Update the ECS service to use the new task definition
aws ecs update-service \
  --cluster production \
  --service myapp-${INACTIVE_ENV} \
  --task-definition myapp:${NEW_VERSION}

# Wait for the service to stabilize
aws ecs wait services-stable \
  --cluster production \
  --services myapp-${INACTIVE_ENV}

# Run smoke tests against the inactive environment directly
# (Bypass the load balancer, hit the target group directly)
run_smoke_tests $INACTIVE_TARGET_GROUP_URL

# Switch traffic
aws elbv2 modify-listener \
  --listener-arn $LISTENER_ARN \
  --default-actions Type=forward,TargetGroupArn=${INACTIVE_TG_ARN}

The Database Problem

Blue-green deployment is clean for stateless services. The database introduces a complication: both environments share it, which means any schema change must be backward-compatible with both the old and new application version simultaneously.

The pattern to follow is expand-contract (also called parallel change):

  1. Expand: add the new column or table, make it nullable, deploy the new version that writes to both old and new structure
  2. Contract: once the old version is fully retired (after the rollback window closes), clean up the old structure
-- Release N: add new column as nullable (old version ignores it)
ALTER TABLE payments ADD COLUMN idempotency_key VARCHAR(64) NULL;

-- Release N: new version writes to both payment_reference (old) and idempotency_key (new)
-- Release N+1: old column is removed, new column gets NOT NULL constraint

This requires more migration steps per schema change, but it makes blue-green safe for services with database persistence — which is most services.

When Blue-Green Is Worth the Cost

The infrastructure cost is real: you're running two environments simultaneously during deployments, and the idle environment consumes resources. For services on Kubernetes or ECS, this typically means 1.3–1.5× the normal running cost during the deployment window.

Blue-green is worth the cost when: rollback speed is critical (financial services, auth systems, anything where a 45-minute rollback means 45 minutes of lost revenue or compromised security), when the change is high-risk (major version upgrades, significant behavior changes), or when the team has had multiple painful rolling-deployment rollbacks that would have been avoided with blue-green.

It's not necessary for every service. It's essential for the ones where it matters. Know which ones those are before the 2pm incident.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

Why Silent Meetings With Cameras On Are a Bad Idea

Staring at a screen full of colleagues who aren’t saying a word is surprisingly stressful. Even with cameras off, the pressure to be “noticed” lingers.

Read more

Why Finding a Senior Backend Developer in Taipei Is Harder Than the City's Tech Reputation Suggests

Taipei has a strong technology identity and a serious engineering culture. Senior backend developers are still surprisingly hard to hire here.

Read more

How to Learn Fast Without Wasting Time on Tutorials

Tutorials can feel like a shortcut—but often they slow you down. Here’s how to learn fast by doing, not just watching.

Read more

REST API Design in Practice — The Decisions That Determine Developer Experience

REST APIs are built once and integrated against indefinitely. The design decisions made in the first hour — resource modeling, error shapes, versioning, pagination — determine how much friction every integration will carry forever.

Read more