What Happens to Your System When One Service Goes Down

by Eric Hanson, Backend Developer at Clean Systems Consulting

The failure scenario most teams haven't war-gamed

Inventory Service goes down at 2 PM on a Tuesday. Within ninety seconds, your entire checkout flow is unavailable. This is not hypothetical — it's a failure pattern that repeats across organizations that haven't designed explicitly for service unavailability.

Here's the propagation path: Order Service calls Inventory Service synchronously on every checkout. With Inventory Service down, those calls block until timeout (let's say 10 seconds, which is already too high). Order Service has a fixed-size Tomcat thread pool of 200 threads. With 20 requests per second, the 200 threads fill up in 10 seconds. Order Service is now effectively down. API Gateway, which calls Order Service, starts queuing requests. Those queue up until its own timeout is hit. Within two minutes, the outage has propagated from one service to all user-facing functionality.

If you haven't explicitly modeled this failure path, you will discover it during an incident.

The three propagation mechanisms

Thread pool / connection pool exhaustion: Synchronous downstream calls consume threads or connections while waiting. Under slow or failed downstream conditions, these pools saturate faster than load increases. This is the mechanism in the example above.

Memory queue overflow: If downstream consumers are down and your service buffers work in an in-memory queue, the queue grows without bound until you hit OOM. Services that accumulate "work to do" without backpressure protection fail this way.

Cascading retries: When Service B is down, Service A retries. Multiple instances of Service A retry. When B recovers, it receives amplified traffic from all the queued retries simultaneously. If B can't handle the retry storm, it goes back down. This oscillation can continue for minutes after the root cause is resolved.

Designing for partial availability

The foundational principle: each service must have a documented answer to the question "what do I do when Service X is unavailable?" This answer should be coded and tested, not figured out during an incident.

Circuit breakers (Resilience4j, Hystrix, Polly) stop calls to known-unhealthy services before threads pile up:

@CircuitBreaker(
    name = "inventoryService",
    fallbackMethod = "inventoryFallback"
)
public InventoryStatus checkAvailability(String itemId) {
    return inventoryClient.getStatus(itemId);
}

private InventoryStatus inventoryFallback(String itemId, Exception ex) {
    // Explicitly defined degraded behavior
    log.warn("Inventory service unavailable, returning optimistic status for {}", itemId);
    return InventoryStatus.optimisticallyAvailable(itemId);
}

The circuit breaker opens after a configurable failure threshold, immediately returning the fallback result for subsequent calls. The downstream service stops receiving calls, which prevents retry amplification and gives it time to recover. After a configured wait, the circuit enters half-open state and allows a test call through.

Bulkheads isolate resource pools so that a slow downstream service can only consume a bounded portion of your service's resources:

resilience4j:
  bulkhead:
    instances:
      inventoryService:
        maxConcurrentCalls: 10  # Inventory calls can use at most 10 threads
        maxWaitDuration: 100ms

With a bulkhead, Inventory Service slowness consumes at most 10 threads. The other 190 remain available for requests that don't depend on Inventory Service. Your service degrades gracefully rather than failing completely.

Health checks that actually reflect dependency health

Kubernetes liveness and readiness probes are your operational interface for controlled failure. Most services implement a trivial liveness check (/health returns 200 if the process is alive) but a readiness check that just checks process health misses the point.

Readiness should reflect whether the service can usefully serve traffic:

@Component
public class ReadinessCheck implements HealthIndicator {
    @Override
    public Health health() {
        boolean dbHealthy = checkDatabaseConnection();
        boolean cacheHealthy = checkCacheConnection();
        
        // Downstream services: fail soft, don't block readiness
        // A service can be ready even if Inventory is slow
        
        if (!dbHealthy) {
            return Health.down()
                .withDetail("reason", "Database unreachable")
                .build();
        }
        return Health.up().build();
    }
}

A service with a failed database should fail its readiness check so Kubernetes stops routing traffic to it. A service with a slow downstream dependency should remain ready and let the circuit breaker handle those calls. These are different failure conditions with different correct responses.

Testing failure scenarios before they happen

Chaos engineering — deliberately injecting failures to validate resilience mechanisms — is the only way to have confidence in your failure handling before it matters. Netflix's Chaos Monkey is the famous implementation, but the principle is simpler: periodically kill a service instance, inject artificial latency, or drop network connections, and observe whether the system behaves as designed.

Minimal chaos testing for a microservices system: once per quarter, take down a non-critical service in staging and verify that dependent services degrade as expected (return fallbacks, log clearly, metrics reflect the degradation). This is cheaper than discovering your circuit breakers were misconfigured during a production incident.

The goal is not a system that never fails — it's a system where any single service failure produces a known, bounded, recoverable effect rather than an unpredictable cascade.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

Dublin's Best Backend Developers Work for Google and Meta — What the Rest of Us Do

You posted a backend role three weeks ago. The only applicants who fit are already at a FAANG company and just "seeing what's out there." They're not leaving.

Read more

What Product Teams Often Miss When Designing Features

Designing features isn’t just about what looks good or what users request. Product teams often overlook the invisible parts that make a feature reliable, scalable, and maintainable.

Read more

How to Design APIs That Survive Version Changes

APIs don’t break all at once. They slowly drift until something snaps. Good design isn’t about avoiding change — it’s about surviving it.

Read more

Spring Boot Caching in Practice — @Cacheable, Cache Warming, and When Caching Makes Things Worse

Spring Boot's caching abstraction makes it easy to add caching to any method. What it doesn't tell you is when caching the wrong things causes stale data bugs, cache stampedes, and memory pressure that's harder to debug than the original performance problem.

Read more