Circuit Breakers in Microservices: Stop Letting One Failure Break Everything
by Eric Hanson, Backend Developer at Clean Systems Consulting
The failure pattern a circuit breaker prevents
Inventory Service is responding slowly — GC pause, database lock contention, doesn't matter. Order Service calls it synchronously on every checkout. Without a circuit breaker: threads in Order Service block for the full timeout duration (say, 10 seconds), the thread pool fills up, Order Service stops accepting new requests. What started as an Inventory Service performance issue is now a complete Order Service outage. Users can't check out. Meanwhile, Inventory Service is getting hammered with retry traffic from every Order Service instance, making its own recovery harder.
The circuit breaker pattern, first named by Michael Nygard in "Release It!", addresses this by tracking failure rates and stopping calls to an unhealthy service before resources are exhausted. It works exactly like an electrical circuit breaker: when the failure rate crosses a threshold, the circuit opens and subsequent calls fail immediately (returning a fallback) instead of waiting for a timeout. After a recovery period, it allows a test call through. If that succeeds, the circuit closes.
The three states
Understanding the state machine is prerequisite to configuring it correctly:
Closed (normal): calls pass through. The circuit counts successes and failures. When the failure rate in a sliding window exceeds the threshold, the circuit opens.
Open (failing): calls are short-circuited immediately — they fail fast without touching the downstream service. The fallback method executes instead. The downstream service gets no traffic, which gives it time to recover.
Half-Open (testing): after the wait duration, a limited number of test calls are allowed through. If they succeed, the circuit closes. If they fail, it opens again for another wait duration.
Resilience4j configuration that actually makes sense
Resilience4j (the Java standard; Polly for .NET, resilience4go for Go) gives you fine-grained control over the circuit breaker parameters:
resilience4j:
circuitbreaker:
instances:
inventoryService:
# How many calls to track in the sliding window
slidingWindowSize: 20
# Open if >= 50% of calls in window fail
failureRateThreshold: 50
# Also open if >= 50% of calls are slow
slowCallRateThreshold: 50
# "Slow" means taking longer than this
slowCallDurationThreshold: 2s
# Minimum calls before the failure rate is evaluated
minimumNumberOfCalls: 5
# How long to stay open before allowing test calls
waitDurationInOpenState: 30s
# How many test calls in half-open state
permittedNumberOfCallsInHalfOpenState: 3
# What counts as a failure
recordExceptions:
- java.io.IOException
- java.util.concurrent.TimeoutException
- feign.FeignException.ServiceUnavailable
The slowCallRateThreshold is frequently overlooked. A service that responds in 8 seconds instead of 200ms is functionally a failure for latency-sensitive paths. Configuring slow call detection prevents thread exhaustion from slow (but not failing) downstream services.
Fallback design is where the real work is
A circuit breaker without a meaningful fallback is just a fast-fail. The fallback is the design decision that determines how well your system degrades when a dependency is unavailable.
Fallback options, in rough order of desirability:
Cached response: return the last known good response for this request. Works well for data that changes slowly (product catalog, user preferences). Requires a local cache layer (Redis, Caffeine). Document the maximum staleness you're willing to accept.
@CircuitBreaker(name = "inventoryService", fallbackMethod = "inventoryFromCache")
public InventoryStatus getInventory(String itemId) {
return inventoryClient.getStatus(itemId);
}
private InventoryStatus inventoryFromCache(String itemId, Exception ex) {
return inventoryCache.get(itemId)
.orElse(InventoryStatus.optimisticallyAvailable(itemId));
}
Degraded response: return a response that indicates reduced functionality. Instead of inventory status, return "availability unknown — complete your order and we'll confirm." This keeps the user flow alive while being honest about uncertainty.
Fail fast with a user-friendly error: if there is no reasonable fallback, fail immediately with a clear error rather than timing out. A 503 with a "we're experiencing issues with inventory checking, please try again" message is better than a 30-second wait followed by a timeout error.
Default behavior: for non-critical features (recommendations, personalization, enhanced product data), return a sensible default. The homepage still loads without personalized recommendations.
Common configuration mistakes
Timeouts set too high: a 30-second timeout on an inventory check means 30 seconds of blocked threads per failed request. Set timeouts at the 99th percentile of acceptable response time for your SLA — often 2–5 seconds for interactive flows. The circuit breaker handles the degradation; the timeout protects thread resources.
Window too small: a slidingWindowSize of 5 means the circuit opens on 3 failures. Under low traffic, you'll get false positives — temporary network blips opening the circuit unnecessarily. Set the window and minimum call count to reflect realistic traffic volumes.
No alerting on open circuits: an open circuit breaker is a signal that a dependency is failing. If nobody is alerted when it opens, you're flying blind. Expose circuit breaker state as a metric (Resilience4j integrates with Micrometer/Prometheus natively) and alert when any circuit has been open for more than N minutes.
# Prometheus alert: circuit breaker has been open for 5 minutes
- alert: CircuitBreakerOpen
expr: resilience4j_circuitbreaker_state{state="open"} == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Circuit breaker {{ $labels.name }} has been open for 5 minutes"
Circuit breakers are not set-and-forget. Tune the thresholds against your actual traffic patterns in staging before relying on them in production.