Blue Green Deployment: The Strategy That Makes Rollbacks Painless

by Eric Hanson, Backend Developer at Clean Systems Consulting

The Rollback That Took 45 Minutes

The deployment went out at 2pm. Error rates climbed at 2:07. The on-call engineer identified the cause at 2:19 — a regression in the new version. The rollback started at 2:22. It completed at 3:07, because rolling back a rolling deployment means spinning up old instances and waiting for traffic to shift back through the same gradual rollout process.

45 minutes. That's the cost of not having an idle environment ready to take traffic.

Blue-green deployment exists specifically to eliminate this: when the new version (green) is misbehaving, the old version (blue) is still fully operational and rollback is a load balancer switch — typically under 60 seconds.

How Blue-Green Works

The model is straightforward. You maintain two production environments — blue and green — that are structurally identical. At any given time, one is live (serving production traffic) and one is idle (the previous version, kept warm).

Deployment works like this:

  1. The idle environment (green, assuming blue is live) is updated with the new version
  2. The new version is validated against production data with production configuration, but without serving live traffic — smoke tests, health checks, sanity checks
  3. Traffic is switched from blue to green at the load balancer level — a single configuration change
  4. Blue remains running, serving no traffic, for a defined warm rollback window (typically 24–48 hours)
  5. If something goes wrong post-switch, rollback is switching the load balancer back to blue — under 60 seconds
         ┌──────────┐       ┌──────────────────────────────┐
Traffic  │  ALB /   │──────▶│  Blue (v1.2 - currently live)│
         │  NLB     │       └──────────────────────────────┘
         └──────────┘
                            ┌──────────────────────────────┐
                            │  Green (v1.3 - being deployed)│
                            └──────────────────────────────┘

After validation and switch:

         ┌──────────┐       ┌──────────────────────────────┐
Traffic  │  ALB /   │──────▶│  Green (v1.3 - now live)     │
         │  NLB     │       └──────────────────────────────┘
         └──────────┘
                            ┌──────────────────────────────┐
                            │  Blue (v1.2 - rollback ready)│
                            └──────────────────────────────┘

Implementing with AWS Elastic Load Balancer and ECS

The AWS implementation uses target groups to represent each environment:

# Create two target groups: one per environment
aws elbv2 create-target-group \
  --name myapp-blue \
  --protocol HTTP \
  --port 8080 \
  --vpc-id $VPC_ID \
  --health-check-path /actuator/health \
  --health-check-interval-seconds 10

aws elbv2 create-target-group \
  --name myapp-green \
  --protocol HTTP \
  --port 8080 \
  --vpc-id $VPC_ID \
  --health-check-path /actuator/health

# Deploy to the inactive target group
# (Identify which is currently active via a tag or a config parameter)
INACTIVE_TG=$(get_inactive_target_group)  # Your logic here

# Update the ECS service to use the new task definition
aws ecs update-service \
  --cluster production \
  --service myapp-${INACTIVE_ENV} \
  --task-definition myapp:${NEW_VERSION}

# Wait for the service to stabilize
aws ecs wait services-stable \
  --cluster production \
  --services myapp-${INACTIVE_ENV}

# Run smoke tests against the inactive environment directly
# (Bypass the load balancer, hit the target group directly)
run_smoke_tests $INACTIVE_TARGET_GROUP_URL

# Switch traffic
aws elbv2 modify-listener \
  --listener-arn $LISTENER_ARN \
  --default-actions Type=forward,TargetGroupArn=${INACTIVE_TG_ARN}

The Database Problem

Blue-green deployment is clean for stateless services. The database introduces a complication: both environments share it, which means any schema change must be backward-compatible with both the old and new application version simultaneously.

The pattern to follow is expand-contract (also called parallel change):

  1. Expand: add the new column or table, make it nullable, deploy the new version that writes to both old and new structure
  2. Contract: once the old version is fully retired (after the rollback window closes), clean up the old structure
-- Release N: add new column as nullable (old version ignores it)
ALTER TABLE payments ADD COLUMN idempotency_key VARCHAR(64) NULL;

-- Release N: new version writes to both payment_reference (old) and idempotency_key (new)
-- Release N+1: old column is removed, new column gets NOT NULL constraint

This requires more migration steps per schema change, but it makes blue-green safe for services with database persistence — which is most services.

When Blue-Green Is Worth the Cost

The infrastructure cost is real: you're running two environments simultaneously during deployments, and the idle environment consumes resources. For services on Kubernetes or ECS, this typically means 1.3–1.5× the normal running cost during the deployment window.

Blue-green is worth the cost when: rollback speed is critical (financial services, auth systems, anything where a 45-minute rollback means 45 minutes of lost revenue or compromised security), when the change is high-risk (major version upgrades, significant behavior changes), or when the team has had multiple painful rolling-deployment rollbacks that would have been avoided with blue-green.

It's not necessary for every service. It's essential for the ones where it matters. Know which ones those are before the 2pm incident.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

SSRF, Path Traversal, and Other Spring Boot Vulnerabilities That Don't Get Enough Attention

SQL injection and XSS get attention. SSRF, path traversal, ReDoS, XXE, and deserialization vulnerabilities are less discussed but appear regularly in penetration tests and bug bounty reports. Here is how each manifests in Spring Boot and how to prevent it.

Read more

Making Clients Feel Confident in Your Work

Confidence isn’t just about skill—it’s about perception. Here’s how to make clients trust that you’ll deliver, even before the work is done.

Read more

Why Your Query Is Slow Even Though You Have an Index

Having an index does not guarantee it will be used — indexes can be bypassed due to function application, type mismatches, poor selectivity estimates, or optimizer decisions that are correct given stale statistics.

Read more

HTTP Response Caching in Spring Boot — Cache-Control Headers, ETags, and CDN Integration

Application-layer caching with @Cacheable keeps data out of the database. HTTP caching with Cache-Control and ETags keeps responses out of the application entirely. The two layers serve different purposes and work best together.

Read more