Spring Boot Microservices — Service-to-Service Communication, Circuit Breakers, and Resilience Patterns

by Eric Hanson, Backend Developer at Clean Systems Consulting

The failure modes in distributed systems

A monolith fails as a unit — the process is up or it's down. In a distributed system, individual components fail independently and in partial ways: a service responds slowly, times out intermittently, or returns errors for a subset of requests. These partial failures are harder to detect and more destructive than total failures.

Three patterns dominate the failure landscape:

Cascading failure. Service A calls Service B, which calls Service C. Service C slows down. Service B's threads fill up waiting on C. Service A's threads fill up waiting on B. The failure propagates up the call chain. Services that had nothing to do with the original slowdown are now failing.

Thread pool exhaustion. A service makes synchronous HTTP calls to a downstream service. If the downstream service is slow, each call holds a thread until it completes or times out. With a thread pool of 200 threads and downstream calls taking 10 seconds, 200 concurrent slow requests exhaust the pool. New requests queue; the service appears hung.

Retry storms. A downstream service returns errors. Callers retry. The retry traffic adds load to an already struggling service, making recovery slower or impossible. Uncoordinated retries from multiple callers amplify the problem.

Resilience patterns address each of these: timeouts limit thread blocking, circuit breakers stop calls to failing services, and retry with backoff prevents retry storms.

WebClient — the non-blocking HTTP client

RestTemplate is the legacy synchronous HTTP client. WebClient from Spring WebFlux is the current standard — it supports both synchronous and reactive usage and integrates with Resilience4j patterns cleanly:

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-webflux</artifactId>
</dependency>
@Configuration
public class WebClientConfig {

    @Bean
    public WebClient orderServiceClient(
            @Value("${services.order-service.base-url}") String baseUrl) {
        return WebClient.builder()
            .baseUrl(baseUrl)
            .defaultHeader(HttpHeaders.CONTENT_TYPE, MediaType.APPLICATION_JSON_VALUE)
            .codecs(config -> config.defaultCodecs()
                .maxInMemorySize(1024 * 1024))  // 1MB response buffer
            .filter(loggingFilter())
            .build();
    }

    private ExchangeFilterFunction loggingFilter() {
        return ExchangeFilterFunction.ofRequestProcessor(request -> {
            log.debug("Request: {} {}", request.method(), request.url());
            return Mono.just(request);
        });
    }
}

Using it synchronously (for Spring MVC contexts):

@Service
public class OrderClient {

    private final WebClient client;

    public Order getOrder(String orderId) {
        return client.get()
            .uri("/orders/{id}", orderId)
            .retrieve()
            .onStatus(HttpStatusCode::is4xxClientError,
                response -> response.bodyToMono(ErrorResponse.class)
                    .flatMap(error -> Mono.error(new OrderClientException(error.message()))))
            .onStatus(HttpStatusCode::is5xxServerError,
                response -> Mono.error(new ServiceUnavailableException("order-service")))
            .bodyToMono(Order.class)
            .block(Duration.ofSeconds(5));  // convert to synchronous with timeout
    }
}

retrieve() + onStatus() handles HTTP error status codes explicitly. Without it, non-2xx responses are returned as successful Mono completions — the caller must check the status manually.

.block(Duration.ofSeconds(5)) converts the reactive call to synchronous with a timeout. If the response doesn't arrive within 5 seconds, WebClient throws TimeoutException. This is the correct pattern for Spring MVC contexts where you want synchronous behavior but not blocking without a timeout.

Resilience4j — the modern resilience library

Resilience4j replaces Netflix Hystrix (maintenance-mode since 2018). Spring Boot integrates it via the resilience4j-spring-boot3 starter:

<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-spring-boot3</artifactId>
</dependency>
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-aop</artifactId>
</dependency>

AOP is required — Resilience4j's Spring annotations use AOP proxies.

Circuit breaker — stopping calls to failing services

A circuit breaker tracks the success/failure rate of calls to a downstream service. When the failure rate exceeds a threshold, the circuit "opens" — subsequent calls fail immediately without attempting the downstream call. After a wait period, the circuit enters "half-open" state: a limited number of test calls are allowed. If they succeed, the circuit closes; if they fail, it reopens.

resilience4j:
  circuitbreaker:
    instances:
      order-service:
        sliding-window-size: 10          # track last 10 calls
        minimum-number-of-calls: 5       # need 5 calls before evaluating
        failure-rate-threshold: 50       # open when 50%+ calls fail
        wait-duration-in-open-state: 30s # wait 30s before trying again
        permitted-number-of-calls-in-half-open-state: 3
        slow-call-duration-threshold: 2s # calls > 2s count as slow
        slow-call-rate-threshold: 80     # open when 80%+ calls are slow
@Service
public class OrderClient {

    @CircuitBreaker(name = "order-service", fallbackMethod = "getOrderFallback")
    public Order getOrder(String orderId) {
        return client.get()
            .uri("/orders/{id}", orderId)
            .retrieve()
            .bodyToMono(Order.class)
            .block(Duration.ofSeconds(5));
    }

    private Order getOrderFallback(String orderId, CallNotPermittedException ex) {
        // Circuit is open — return cached or degraded response
        log.warn("Circuit open for order-service, using fallback for order {}", orderId);
        return orderCache.getIfPresent(orderId)
            .orElseThrow(() -> new ServiceUnavailableException("order-service unavailable"));
    }

    private Order getOrderFallback(String orderId, Exception ex) {
        // General failure fallback
        return Order.unavailable(orderId);
    }
}

Fallback methods must have the same return type as the primary method, plus an exception parameter as the last argument. Resilience4j selects the most specific fallback based on exception type — CallNotPermittedException for open circuit, Exception as the catch-all.

Monitoring circuit breaker state via Actuator:

GET /actuator/health
{
  "components": {
    "circuitBreakers": {
      "details": {
        "order-service": {
          "status": "UP",
          "details": {
            "state": "CLOSED",
            "failureRate": "20.0%",
            "slowCallRate": "0.0%"
          }
        }
      }
    }
  }
}

Alert on circuit breaker state changes — CLOSED to OPEN is a signal that a downstream service is degraded.

Retry — with backoff and jitter

Retries handle transient failures — network glitches, brief service restarts, rate limiting. Without controls, retries amplify load on a struggling service:

resilience4j:
  retry:
    instances:
      order-service:
        max-attempts: 3
        wait-duration: 500ms
        enable-exponential-backoff: true
        exponential-backoff-multiplier: 2
        exponential-max-wait-duration: 10s
        retry-exceptions:
          - java.net.ConnectException
          - java.net.SocketTimeoutException
          - org.springframework.web.reactive.function.client.WebClientResponseException$ServiceUnavailable
        ignore-exceptions:
          - com.example.OrderClientException  # 4xx errors — don't retry
@Retry(name = "order-service", fallbackMethod = "getOrderFallback")
@CircuitBreaker(name = "order-service", fallbackMethod = "getOrderFallback")
public Order getOrder(String orderId) {
    return client.get()
        .uri("/orders/{id}", orderId)
        .retrieve()
        .bodyToMono(Order.class)
        .block(Duration.ofSeconds(5));
}

@Retry and @CircuitBreaker compose — retries happen first, then the circuit breaker tracks the final outcome. If all retries fail, the circuit breaker records one failure (not three).

Jitter. Exponential backoff without jitter causes synchronized retries when multiple callers hit the same service failure simultaneously — they all back off to the same interval and retry together. Add jitter:

resilience4j:
  retry:
    instances:
      order-service:
        enable-exponential-backoff: true
        exponential-backoff-multiplier: 2
        randomized-wait-factor: 0.5  # ±50% jitter on wait duration

With randomized-wait-factor: 0.5 and a 500ms base, retries happen between 250ms and 750ms — staggered across callers.

Rate limiter — protecting downstream services

A rate limiter prevents sending more requests than a downstream service can handle:

resilience4j:
  ratelimiter:
    instances:
      payment-service:
        limit-for-period: 100          # max 100 calls per period
        limit-refresh-period: 1s       # period resets every second
        timeout-duration: 100ms        # wait up to 100ms to acquire a permit
@RateLimiter(name = "payment-service", fallbackMethod = "chargeRateLimitFallback")
public PaymentResult charge(PaymentRequest request) {
    return paymentClient.charge(request);
}

private PaymentResult chargeRateLimitFallback(PaymentRequest request,
        RequestNotPermitted ex) {
    throw new RateLimitExceededException("Payment service rate limit exceeded");
}

The rate limiter is useful when a downstream service has documented rate limits (payment processors, email services, third-party APIs). It prevents your service from exceeding those limits and triggering 429 responses.

Bulkhead — isolating failures

A bulkhead limits the concurrent calls to a downstream service, preventing one slow downstream from consuming all available threads:

resilience4j:
  bulkhead:
    instances:
      order-service:
        max-concurrent-calls: 20      # max 20 concurrent calls
        max-wait-duration: 100ms      # wait up to 100ms if limit reached
@Bulkhead(name = "order-service", type = Bulkhead.Type.SEMAPHORE)
public Order getOrder(String orderId) {
    return client.get()
        .uri("/orders/{id}", orderId)
        .retrieve()
        .bodyToMono(Order.class)
        .block(Duration.ofSeconds(5));
}

The semaphore bulkhead limits concurrent calls using a counting semaphore. THREAD_POOL bulkhead uses a dedicated thread pool per downstream service — stronger isolation but higher thread overhead.

Combining patterns — the right order

Multiple Resilience4j annotations compose. The execution order matters:

@Retry(name = "order-service")        // 1. outermost — retries the whole thing
@CircuitBreaker(name = "order-service")  // 2. checks circuit state before attempt
@Bulkhead(name = "order-service")     // 3. acquires concurrent call permit
@TimeLimiter(name = "order-service")  // 4. applies timeout to the call
public CompletableFuture<Order> getOrder(String orderId) {
    return CompletableFuture.supplyAsync(() ->
        client.get()
            .uri("/orders/{id}", orderId)
            .retrieve()
            .bodyToMono(Order.class)
            .block()
    );
}

The standard order: Retry → CircuitBreaker → Bulkhead → TimeLimiter → actual call. The retry wraps everything — if the circuit is open or the bulkhead is full, the retry sees a CallNotPermittedException or BulkheadFullException and may retry based on configuration. Usually you want to configure retry to NOT retry these — only retry on transient network errors:

resilience4j:
  retry:
    instances:
      order-service:
        ignore-exceptions:
          - io.github.resilience4j.circuitbreaker.CallNotPermittedException
          - io.github.resilience4j.bulkhead.BulkheadFullException

The timeout that actually matters

A timeout without a fallback just means slower failure. The meaningful timeout is the one that protects your thread pool from being exhausted:

public Order getOrder(String orderId) {
    try {
        return client.get()
            .uri("/orders/{id}", orderId)
            .retrieve()
            .bodyToMono(Order.class)
            .timeout(Duration.ofSeconds(3))  // WebClient-level timeout
            .block(Duration.ofSeconds(4));   // block() timeout as safety net
    } catch (TimeoutException | WebClientRequestException ex) {
        log.warn("Timeout calling order-service for order {}", orderId);
        return orderCache.getIfPresent(orderId)
            .orElse(Order.unavailable(orderId));
    }
}

Two timeout layers: .timeout() on the reactive pipeline cancels the subscription after 3 seconds; .block() with a 4-second timeout prevents the thread from blocking indefinitely if the reactive timeout fires but doesn't propagate cleanly. The fallback (cache or degraded response) determines whether the caller experiences the failure.

Health and observability for resilience patterns

Resilience4j exports metrics to Micrometer automatically:

resilience4j.circuitbreaker.calls{name="order-service", kind="successful"}
resilience4j.circuitbreaker.calls{name="order-service", kind="failed"}
resilience4j.circuitbreaker.calls{name="order-service", kind="not_permitted"}
resilience4j.circuitbreaker.state{name="order-service"}  # 0=CLOSED, 1=OPEN, 2=HALF_OPEN
resilience4j.retry.calls{name="order-service", kind="successful_without_retry"}
resilience4j.retry.calls{name="order-service", kind="successful_with_retry"}
resilience4j.retry.calls{name="order-service", kind="failed_with_retry"}

Alert on:

  • state=1 (OPEN circuit) for more than 60 seconds — downstream service is persistently degraded
  • kind="failed_with_retry" rate increase — retries are failing, indicating a persistent downstream issue
  • kind="not_permitted" rate — callers hitting open circuits, indicating a degraded user experience

The combination of circuit breakers, retries, bulkheads, and monitoring creates a system that degrades gracefully rather than cascading — downstream failures are contained, retried appropriately, and surfaced through metrics before users experience the full impact.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

The Difference Between an API That Works and an API Developers Enjoy Using

Functional correctness is the floor, not the ceiling. The APIs developers choose to build on have properties that go beyond working — they are predictable, honest, and low-friction.

Read more

Flash Drives, Multi-Layer RDP, and Manager Approvals: A Day in a Bureaucratic Dev Team

You sit down to fix a small bug. It should take 10 minutes. Six hours later, you’re still waiting—for access, for approval, for something to happen.

Read more

When the Client Forgets to Pay You (or Pretends They Did)

It’s awkward, frustrating, and more common than you think. Handling unpaid invoices gracefully can save relationships—and your sanity.

Read more

7 Essential Insurances Every Remote Contractor Should Have

Remote contractors focus on results, not office presence. With fewer meetings and clearer scope, work moves faster and more efficiently.

Read more