What Happens to Your System When One Service Goes Down

by Arif Ikhsanudin, Backend Developer

The failure scenario most teams haven't war-gamed

Inventory Service goes down at 2 PM on a Tuesday. Within ninety seconds, your entire checkout flow is unavailable. This is not hypothetical — it's a failure pattern that repeats across organizations that haven't designed explicitly for service unavailability.

Here's the propagation path: Order Service calls Inventory Service synchronously on every checkout. With Inventory Service down, those calls block until timeout (let's say 10 seconds, which is already too high). Order Service has a fixed-size Tomcat thread pool of 200 threads. With 20 requests per second, the 200 threads fill up in 10 seconds. Order Service is now effectively down. API Gateway, which calls Order Service, starts queuing requests. Those queue up until its own timeout is hit. Within two minutes, the outage has propagated from one service to all user-facing functionality.

If you haven't explicitly modeled this failure path, you will discover it during an incident.

The three propagation mechanisms

Thread pool / connection pool exhaustion: Synchronous downstream calls consume threads or connections while waiting. Under slow or failed downstream conditions, these pools saturate faster than load increases. This is the mechanism in the example above.

Memory queue overflow: If downstream consumers are down and your service buffers work in an in-memory queue, the queue grows without bound until you hit OOM. Services that accumulate "work to do" without backpressure protection fail this way.

Cascading retries: When Service B is down, Service A retries. Multiple instances of Service A retry. When B recovers, it receives amplified traffic from all the queued retries simultaneously. If B can't handle the retry storm, it goes back down. This oscillation can continue for minutes after the root cause is resolved.

Designing for partial availability

The foundational principle: each service must have a documented answer to the question "what do I do when Service X is unavailable?" This answer should be coded and tested, not figured out during an incident.

Circuit breakers (Resilience4j, Hystrix, Polly) stop calls to known-unhealthy services before threads pile up:

@CircuitBreaker(
    name = "inventoryService",
    fallbackMethod = "inventoryFallback"
)
public InventoryStatus checkAvailability(String itemId) {
    return inventoryClient.getStatus(itemId);
}

private InventoryStatus inventoryFallback(String itemId, Exception ex) {
    // Explicitly defined degraded behavior
    log.warn("Inventory service unavailable, returning optimistic status for {}", itemId);
    return InventoryStatus.optimisticallyAvailable(itemId);
}

The circuit breaker opens after a configurable failure threshold, immediately returning the fallback result for subsequent calls. The downstream service stops receiving calls, which prevents retry amplification and gives it time to recover. After a configured wait, the circuit enters half-open state and allows a test call through.

Bulkheads isolate resource pools so that a slow downstream service can only consume a bounded portion of your service's resources:

resilience4j:
  bulkhead:
    instances:
      inventoryService:
        maxConcurrentCalls: 10  # Inventory calls can use at most 10 threads
        maxWaitDuration: 100ms

With a bulkhead, Inventory Service slowness consumes at most 10 threads. The other 190 remain available for requests that don't depend on Inventory Service. Your service degrades gracefully rather than failing completely.

Health checks that actually reflect dependency health

Kubernetes liveness and readiness probes are your operational interface for controlled failure. Most services implement a trivial liveness check (/health returns 200 if the process is alive) but a readiness check that just checks process health misses the point.

Readiness should reflect whether the service can usefully serve traffic:

@Component
public class ReadinessCheck implements HealthIndicator {
    @Override
    public Health health() {
        boolean dbHealthy = checkDatabaseConnection();
        boolean cacheHealthy = checkCacheConnection();
        
        // Downstream services: fail soft, don't block readiness
        // A service can be ready even if Inventory is slow
        
        if (!dbHealthy) {
            return Health.down()
                .withDetail("reason", "Database unreachable")
                .build();
        }
        return Health.up().build();
    }
}

A service with a failed database should fail its readiness check so Kubernetes stops routing traffic to it. A service with a slow downstream dependency should remain ready and let the circuit breaker handle those calls. These are different failure conditions with different correct responses.

Testing failure scenarios before they happen

Chaos engineering — deliberately injecting failures to validate resilience mechanisms — is the only way to have confidence in your failure handling before it matters. Netflix's Chaos Monkey is the famous implementation, but the principle is simpler: periodically kill a service instance, inject artificial latency, or drop network connections, and observe whether the system behaves as designed.

Minimal chaos testing for a microservices system: once per quarter, take down a non-critical service in staging and verify that dependent services degrade as expected (return fallbacks, log clearly, metrics reflect the degradation). This is cheaper than discovering your circuit breakers were misconfigured during a production incident.

The goal is not a system that never fails — it's a system where any single service failure produces a known, bounded, recoverable effect rather than an unpredictable cascade.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

Hibernate Bulk Operations — update_all, delete_all, and Bypassing Entity Lifecycle

Loading entities to update or delete them one at a time is the JPA default and the worst approach for bulk operations. Here is when and how to execute bulk operations efficiently — and what you give up when you bypass the entity lifecycle.

Read more

How to Design APIs That Survive Version Changes

APIs don’t break all at once. They slowly drift until something snaps. Good design isn’t about avoiding change — it’s about surviving it.

Read more

How to Write a Proposal That Gets a Response

Most contractor proposals are ignored not because the work is wrong for the client, but because the document is written for the contractor, not the reader.

Read more

Event Driven Architecture: Powerful Pattern or Distributed Mess

Event-driven architecture is genuinely powerful for the problems it fits. It is also one of the easiest patterns to apply incorrectly, producing systems that are hard to debug, hard to reason about, and fragile in non-obvious ways.

Read more