What Happens to Your System When One Service Goes Down
by Eric Hanson, Backend Developer at Clean Systems Consulting
The failure scenario most teams haven't war-gamed
Inventory Service goes down at 2 PM on a Tuesday. Within ninety seconds, your entire checkout flow is unavailable. This is not hypothetical — it's a failure pattern that repeats across organizations that haven't designed explicitly for service unavailability.
Here's the propagation path: Order Service calls Inventory Service synchronously on every checkout. With Inventory Service down, those calls block until timeout (let's say 10 seconds, which is already too high). Order Service has a fixed-size Tomcat thread pool of 200 threads. With 20 requests per second, the 200 threads fill up in 10 seconds. Order Service is now effectively down. API Gateway, which calls Order Service, starts queuing requests. Those queue up until its own timeout is hit. Within two minutes, the outage has propagated from one service to all user-facing functionality.
If you haven't explicitly modeled this failure path, you will discover it during an incident.
The three propagation mechanisms
Thread pool / connection pool exhaustion: Synchronous downstream calls consume threads or connections while waiting. Under slow or failed downstream conditions, these pools saturate faster than load increases. This is the mechanism in the example above.
Memory queue overflow: If downstream consumers are down and your service buffers work in an in-memory queue, the queue grows without bound until you hit OOM. Services that accumulate "work to do" without backpressure protection fail this way.
Cascading retries: When Service B is down, Service A retries. Multiple instances of Service A retry. When B recovers, it receives amplified traffic from all the queued retries simultaneously. If B can't handle the retry storm, it goes back down. This oscillation can continue for minutes after the root cause is resolved.
Designing for partial availability
The foundational principle: each service must have a documented answer to the question "what do I do when Service X is unavailable?" This answer should be coded and tested, not figured out during an incident.
Circuit breakers (Resilience4j, Hystrix, Polly) stop calls to known-unhealthy services before threads pile up:
@CircuitBreaker(
name = "inventoryService",
fallbackMethod = "inventoryFallback"
)
public InventoryStatus checkAvailability(String itemId) {
return inventoryClient.getStatus(itemId);
}
private InventoryStatus inventoryFallback(String itemId, Exception ex) {
// Explicitly defined degraded behavior
log.warn("Inventory service unavailable, returning optimistic status for {}", itemId);
return InventoryStatus.optimisticallyAvailable(itemId);
}
The circuit breaker opens after a configurable failure threshold, immediately returning the fallback result for subsequent calls. The downstream service stops receiving calls, which prevents retry amplification and gives it time to recover. After a configured wait, the circuit enters half-open state and allows a test call through.
Bulkheads isolate resource pools so that a slow downstream service can only consume a bounded portion of your service's resources:
resilience4j:
bulkhead:
instances:
inventoryService:
maxConcurrentCalls: 10 # Inventory calls can use at most 10 threads
maxWaitDuration: 100ms
With a bulkhead, Inventory Service slowness consumes at most 10 threads. The other 190 remain available for requests that don't depend on Inventory Service. Your service degrades gracefully rather than failing completely.
Health checks that actually reflect dependency health
Kubernetes liveness and readiness probes are your operational interface for controlled failure. Most services implement a trivial liveness check (/health returns 200 if the process is alive) but a readiness check that just checks process health misses the point.
Readiness should reflect whether the service can usefully serve traffic:
@Component
public class ReadinessCheck implements HealthIndicator {
@Override
public Health health() {
boolean dbHealthy = checkDatabaseConnection();
boolean cacheHealthy = checkCacheConnection();
// Downstream services: fail soft, don't block readiness
// A service can be ready even if Inventory is slow
if (!dbHealthy) {
return Health.down()
.withDetail("reason", "Database unreachable")
.build();
}
return Health.up().build();
}
}
A service with a failed database should fail its readiness check so Kubernetes stops routing traffic to it. A service with a slow downstream dependency should remain ready and let the circuit breaker handle those calls. These are different failure conditions with different correct responses.
Testing failure scenarios before they happen
Chaos engineering — deliberately injecting failures to validate resilience mechanisms — is the only way to have confidence in your failure handling before it matters. Netflix's Chaos Monkey is the famous implementation, but the principle is simpler: periodically kill a service instance, inject artificial latency, or drop network connections, and observe whether the system behaves as designed.
Minimal chaos testing for a microservices system: once per quarter, take down a non-critical service in staging and verify that dependent services degrade as expected (return fallbacks, log clearly, metrics reflect the degradation). This is cheaper than discovering your circuit breakers were misconfigured during a production incident.
The goal is not a system that never fails — it's a system where any single service failure produces a known, bounded, recoverable effect rather than an unpredictable cascade.