Rolling Deployments: Safe by Default If You Do Them Right

by Arif Ikhsanudin, Backend Developer

The Default That's Not Configured

Your Kubernetes deployment uses RollingUpdate strategy. You didn't set it explicitly — it's the default. Kubernetes replaces old pods with new pods, a few at a time, until the rollout is complete. This looks correct from the outside: the deployment progresses, health checks pass, the new version is live.

What you didn't configure: how quickly old pods are replaced, what "healthy" means before traffic is sent to a new pod, how long the old pod continues receiving requests after the new one starts, and whether the new version is actually compatible with requests already in flight. The defaults for most of these are permissive in ways that can cause real problems.

What the Defaults Actually Do

In Kubernetes with default RollingUpdate settings:

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 25%        # Allow up to 25% extra pods during rollout
    maxUnavailable: 25%  # Allow up to 25% of pods to be unavailable

With 8 replicas, Kubernetes can take down 2 pods simultaneously and start 2 new ones. If the new pods start quickly and pass the health check, this moves fast. If the health check is misconfigured — checking TCP connectivity rather than actual readiness — pods receive traffic before they're truly ready.

The missing configuration that makes rolling deployments actually safe:

# Deployment spec
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0    # Never go below desired capacity during rollout

  template:
    spec:
      containers:
        - name: myapp

          # Readiness probe: traffic only routes here when this passes
          readinessProbe:
            httpGet:
              path: /actuator/health/readiness
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
            failureThreshold: 3
            successThreshold: 1

          # Liveness probe: restart the container if this fails
          livenessProbe:
            httpGet:
              path: /actuator/health/liveness
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
            failureThreshold: 3

          # Graceful shutdown: finish in-flight requests before terminating
          lifecycle:
            preStop:
              exec:
                command: ["sh", "-c", "sleep 5"]  # Drain window

      terminationGracePeriodSeconds: 60

maxUnavailable: 0 ensures capacity never drops below the desired count — a new pod must be ready before an old one is removed. Combined with a real readiness probe (not just TCP), this prevents traffic from reaching pods before they're actually handling requests correctly.

The Readiness Probe Is the Most Important Knob

The readiness probe decides whether a pod receives traffic. If it's wrong, the rest of the configuration doesn't matter.

A common mistake: using the liveness probe path for readiness. Liveness checks whether the process is alive. Readiness checks whether it's ready to serve traffic — which means all connection pools are initialized, caches are warm (if required), and downstream dependencies are reachable.

For Spring Boot, use the built-in readiness and liveness actuator endpoints that integrate with Spring's ApplicationContext lifecycle:

# application.yml
management:
  endpoint:
    health:
      probes:
        enabled: true
  health:
    readinessState:
      enabled: true
    livenessState:
      enabled: true

Spring Boot marks the readiness endpoint as OUT_OF_SERVICE until the application context is fully initialized and all ApplicationListener<AvailabilityChangeEvent> handlers confirm readiness. This means Kubernetes won't route traffic to a pod that's still running @PostConstruct initialization or warming connection pools.

Handling In-Flight Requests During Rollout

When Kubernetes terminates an old pod, requests that are already in progress on that pod need time to complete. Without a graceful shutdown window, those requests get abruptly terminated.

The preStop sleep combined with terminationGracePeriodSeconds creates that window. The sequence on pod termination:

  1. Pod is removed from the Service endpoint (no new requests routed here)
  2. preStop hook runs (sleep 5 seconds — wait for load balancer to propagate the removal)
  3. SIGTERM is sent to the container
  4. Application handles SIGTERM by stopping accepting new requests and finishing in-flight ones
  5. After terminationGracePeriodSeconds, SIGKILL is sent if the process hasn't exited

For Spring Boot, configure graceful shutdown:

# application.yml
server:
  shutdown: graceful
spring:
  lifecycle:
    timeout-per-shutdown-phase: 30s

This tells Spring to drain active requests before completing the shutdown, up to 30 seconds.

The Compatibility Requirement

Rolling deployments mean both versions run simultaneously during the rollout window. Any API contract, database schema, or message format must be compatible with both versions during that window.

A v1.3 pod writing records to the database that v1.2 pods can't read will cause errors for requests that land on v1.2 pods during the overlap period. The expand-contract migration pattern (add nullable column in one release, make it required in the next) prevents this.

The key question before every rolling deployment: is my new version compatible with the currently deployed version, both ways? If not, consider blue-green instead — it eliminates the mixed-version window entirely. Rolling deployment safety depends on that compatibility invariant. Don't violate it silently.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

Feeling Underqualified? How to Fake Confidence (Safely)

Everyone feels underqualified sometimes, especially early in their career. Here’s how to appear confident without pretending to be an expert you’re not.

Read more

Lessons From Failed Software Projects

Failure stings, but in software, it’s often a faster teacher than success. By analyzing what went wrong, teams can avoid repeating mistakes and build smarter, more resilient projects.

Read more

The Real Cost of a Senior Backend Hire in Copenhagen — And What Smart Founders Do Instead

You thought a senior backend hire would cost DKK 70K a month. The real number — once Denmark's employer obligations are factored in — is closer to DKK 100K. And that's before the recruiter calls.

Read more

Stockholm Startups Can't Hire Backend Engineers Fast Enough — Here Is What Actually Works

You posted the backend role eight weeks ago. You've had twelve applicants, four interviews, and zero offers accepted. Meanwhile, the integration your sales team promised a client is still sitting in the backlog collecting dust.

Read more