Circuit Breakers in Microservices: Stop Letting One Failure Break Everything

by Eric Hanson, Backend Developer at Clean Systems Consulting

The failure pattern a circuit breaker prevents

Inventory Service is responding slowly — GC pause, database lock contention, doesn't matter. Order Service calls it synchronously on every checkout. Without a circuit breaker: threads in Order Service block for the full timeout duration (say, 10 seconds), the thread pool fills up, Order Service stops accepting new requests. What started as an Inventory Service performance issue is now a complete Order Service outage. Users can't check out. Meanwhile, Inventory Service is getting hammered with retry traffic from every Order Service instance, making its own recovery harder.

The circuit breaker pattern, first named by Michael Nygard in "Release It!", addresses this by tracking failure rates and stopping calls to an unhealthy service before resources are exhausted. It works exactly like an electrical circuit breaker: when the failure rate crosses a threshold, the circuit opens and subsequent calls fail immediately (returning a fallback) instead of waiting for a timeout. After a recovery period, it allows a test call through. If that succeeds, the circuit closes.

The three states

Understanding the state machine is prerequisite to configuring it correctly:

Closed (normal): calls pass through. The circuit counts successes and failures. When the failure rate in a sliding window exceeds the threshold, the circuit opens.

Open (failing): calls are short-circuited immediately — they fail fast without touching the downstream service. The fallback method executes instead. The downstream service gets no traffic, which gives it time to recover.

Half-Open (testing): after the wait duration, a limited number of test calls are allowed through. If they succeed, the circuit closes. If they fail, it opens again for another wait duration.

Resilience4j configuration that actually makes sense

Resilience4j (the Java standard; Polly for .NET, resilience4go for Go) gives you fine-grained control over the circuit breaker parameters:

resilience4j:
  circuitbreaker:
    instances:
      inventoryService:
        # How many calls to track in the sliding window
        slidingWindowSize: 20
        # Open if >= 50% of calls in window fail
        failureRateThreshold: 50
        # Also open if >= 50% of calls are slow
        slowCallRateThreshold: 50
        # "Slow" means taking longer than this
        slowCallDurationThreshold: 2s
        # Minimum calls before the failure rate is evaluated
        minimumNumberOfCalls: 5
        # How long to stay open before allowing test calls
        waitDurationInOpenState: 30s
        # How many test calls in half-open state
        permittedNumberOfCallsInHalfOpenState: 3
        # What counts as a failure
        recordExceptions:
          - java.io.IOException
          - java.util.concurrent.TimeoutException
          - feign.FeignException.ServiceUnavailable

The slowCallRateThreshold is frequently overlooked. A service that responds in 8 seconds instead of 200ms is functionally a failure for latency-sensitive paths. Configuring slow call detection prevents thread exhaustion from slow (but not failing) downstream services.

Fallback design is where the real work is

A circuit breaker without a meaningful fallback is just a fast-fail. The fallback is the design decision that determines how well your system degrades when a dependency is unavailable.

Fallback options, in rough order of desirability:

Cached response: return the last known good response for this request. Works well for data that changes slowly (product catalog, user preferences). Requires a local cache layer (Redis, Caffeine). Document the maximum staleness you're willing to accept.

@CircuitBreaker(name = "inventoryService", fallbackMethod = "inventoryFromCache")
public InventoryStatus getInventory(String itemId) {
    return inventoryClient.getStatus(itemId);
}

private InventoryStatus inventoryFromCache(String itemId, Exception ex) {
    return inventoryCache.get(itemId)
        .orElse(InventoryStatus.optimisticallyAvailable(itemId));
}

Degraded response: return a response that indicates reduced functionality. Instead of inventory status, return "availability unknown — complete your order and we'll confirm." This keeps the user flow alive while being honest about uncertainty.

Fail fast with a user-friendly error: if there is no reasonable fallback, fail immediately with a clear error rather than timing out. A 503 with a "we're experiencing issues with inventory checking, please try again" message is better than a 30-second wait followed by a timeout error.

Default behavior: for non-critical features (recommendations, personalization, enhanced product data), return a sensible default. The homepage still loads without personalized recommendations.

Common configuration mistakes

Timeouts set too high: a 30-second timeout on an inventory check means 30 seconds of blocked threads per failed request. Set timeouts at the 99th percentile of acceptable response time for your SLA — often 2–5 seconds for interactive flows. The circuit breaker handles the degradation; the timeout protects thread resources.

Window too small: a slidingWindowSize of 5 means the circuit opens on 3 failures. Under low traffic, you'll get false positives — temporary network blips opening the circuit unnecessarily. Set the window and minimum call count to reflect realistic traffic volumes.

No alerting on open circuits: an open circuit breaker is a signal that a dependency is failing. If nobody is alerted when it opens, you're flying blind. Expose circuit breaker state as a metric (Resilience4j integrates with Micrometer/Prometheus natively) and alert when any circuit has been open for more than N minutes.

# Prometheus alert: circuit breaker has been open for 5 minutes
- alert: CircuitBreakerOpen
  expr: resilience4j_circuitbreaker_state{state="open"} == 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Circuit breaker {{ $labels.name }} has been open for 5 minutes"

Circuit breakers are not set-and-forget. Tune the thresholds against your actual traffic patterns in staging before relying on them in production.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

Lazy vs Eager Loading in JPA — What Gets Loaded and When

JPA's fetch type determines when associated data is loaded from the database. Getting it wrong in either direction — too eager or too lazy — produces either unnecessary data transfer or N+1 queries. Here is the model and the correct defaults.

Read more

Service Locator vs Dependency Injection in Java — Understanding the Tradeoffs

Both patterns resolve dependencies, but they make opposite choices about who controls the lookup. The difference has concrete consequences for testability, transparency, and how errors surface.

Read more

Java Memory Leaks in Practice — How They Form and How to Find Them

Java memory leaks are not about forgetting to free memory — the GC handles that. They are about holding references longer than necessary. Here are the specific patterns that cause them and the tooling that finds them.

Read more

Repeat Clients Are the Best Clients. Here Is How to Earn Them.

Repeat clients are more profitable, less work to acquire, and usually better to work with than new ones. The contractors who build a base of them are playing a different game than those who do not.

Read more