Spring Boot Microservices — Service-to-Service Communication, Circuit Breakers, and Resilience Patterns
by Eric Hanson, Backend Developer at Clean Systems Consulting
The failure modes in distributed systems
A monolith fails as a unit — the process is up or it's down. In a distributed system, individual components fail independently and in partial ways: a service responds slowly, times out intermittently, or returns errors for a subset of requests. These partial failures are harder to detect and more destructive than total failures.
Three patterns dominate the failure landscape:
Cascading failure. Service A calls Service B, which calls Service C. Service C slows down. Service B's threads fill up waiting on C. Service A's threads fill up waiting on B. The failure propagates up the call chain. Services that had nothing to do with the original slowdown are now failing.
Thread pool exhaustion. A service makes synchronous HTTP calls to a downstream service. If the downstream service is slow, each call holds a thread until it completes or times out. With a thread pool of 200 threads and downstream calls taking 10 seconds, 200 concurrent slow requests exhaust the pool. New requests queue; the service appears hung.
Retry storms. A downstream service returns errors. Callers retry. The retry traffic adds load to an already struggling service, making recovery slower or impossible. Uncoordinated retries from multiple callers amplify the problem.
Resilience patterns address each of these: timeouts limit thread blocking, circuit breakers stop calls to failing services, and retry with backoff prevents retry storms.
WebClient — the non-blocking HTTP client
RestTemplate is the legacy synchronous HTTP client. WebClient from Spring WebFlux is the current standard — it supports both synchronous and reactive usage and integrates with Resilience4j patterns cleanly:
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-webflux</artifactId>
</dependency>
@Configuration
public class WebClientConfig {
@Bean
public WebClient orderServiceClient(
@Value("${services.order-service.base-url}") String baseUrl) {
return WebClient.builder()
.baseUrl(baseUrl)
.defaultHeader(HttpHeaders.CONTENT_TYPE, MediaType.APPLICATION_JSON_VALUE)
.codecs(config -> config.defaultCodecs()
.maxInMemorySize(1024 * 1024)) // 1MB response buffer
.filter(loggingFilter())
.build();
}
private ExchangeFilterFunction loggingFilter() {
return ExchangeFilterFunction.ofRequestProcessor(request -> {
log.debug("Request: {} {}", request.method(), request.url());
return Mono.just(request);
});
}
}
Using it synchronously (for Spring MVC contexts):
@Service
public class OrderClient {
private final WebClient client;
public Order getOrder(String orderId) {
return client.get()
.uri("/orders/{id}", orderId)
.retrieve()
.onStatus(HttpStatusCode::is4xxClientError,
response -> response.bodyToMono(ErrorResponse.class)
.flatMap(error -> Mono.error(new OrderClientException(error.message()))))
.onStatus(HttpStatusCode::is5xxServerError,
response -> Mono.error(new ServiceUnavailableException("order-service")))
.bodyToMono(Order.class)
.block(Duration.ofSeconds(5)); // convert to synchronous with timeout
}
}
retrieve() + onStatus() handles HTTP error status codes explicitly. Without it, non-2xx responses are returned as successful Mono completions — the caller must check the status manually.
.block(Duration.ofSeconds(5)) converts the reactive call to synchronous with a timeout. If the response doesn't arrive within 5 seconds, WebClient throws TimeoutException. This is the correct pattern for Spring MVC contexts where you want synchronous behavior but not blocking without a timeout.
Resilience4j — the modern resilience library
Resilience4j replaces Netflix Hystrix (maintenance-mode since 2018). Spring Boot integrates it via the resilience4j-spring-boot3 starter:
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-spring-boot3</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-aop</artifactId>
</dependency>
AOP is required — Resilience4j's Spring annotations use AOP proxies.
Circuit breaker — stopping calls to failing services
A circuit breaker tracks the success/failure rate of calls to a downstream service. When the failure rate exceeds a threshold, the circuit "opens" — subsequent calls fail immediately without attempting the downstream call. After a wait period, the circuit enters "half-open" state: a limited number of test calls are allowed. If they succeed, the circuit closes; if they fail, it reopens.
resilience4j:
circuitbreaker:
instances:
order-service:
sliding-window-size: 10 # track last 10 calls
minimum-number-of-calls: 5 # need 5 calls before evaluating
failure-rate-threshold: 50 # open when 50%+ calls fail
wait-duration-in-open-state: 30s # wait 30s before trying again
permitted-number-of-calls-in-half-open-state: 3
slow-call-duration-threshold: 2s # calls > 2s count as slow
slow-call-rate-threshold: 80 # open when 80%+ calls are slow
@Service
public class OrderClient {
@CircuitBreaker(name = "order-service", fallbackMethod = "getOrderFallback")
public Order getOrder(String orderId) {
return client.get()
.uri("/orders/{id}", orderId)
.retrieve()
.bodyToMono(Order.class)
.block(Duration.ofSeconds(5));
}
private Order getOrderFallback(String orderId, CallNotPermittedException ex) {
// Circuit is open — return cached or degraded response
log.warn("Circuit open for order-service, using fallback for order {}", orderId);
return orderCache.getIfPresent(orderId)
.orElseThrow(() -> new ServiceUnavailableException("order-service unavailable"));
}
private Order getOrderFallback(String orderId, Exception ex) {
// General failure fallback
return Order.unavailable(orderId);
}
}
Fallback methods must have the same return type as the primary method, plus an exception parameter as the last argument. Resilience4j selects the most specific fallback based on exception type — CallNotPermittedException for open circuit, Exception as the catch-all.
Monitoring circuit breaker state via Actuator:
GET /actuator/health
{
"components": {
"circuitBreakers": {
"details": {
"order-service": {
"status": "UP",
"details": {
"state": "CLOSED",
"failureRate": "20.0%",
"slowCallRate": "0.0%"
}
}
}
}
}
}
Alert on circuit breaker state changes — CLOSED to OPEN is a signal that a downstream service is degraded.
Retry — with backoff and jitter
Retries handle transient failures — network glitches, brief service restarts, rate limiting. Without controls, retries amplify load on a struggling service:
resilience4j:
retry:
instances:
order-service:
max-attempts: 3
wait-duration: 500ms
enable-exponential-backoff: true
exponential-backoff-multiplier: 2
exponential-max-wait-duration: 10s
retry-exceptions:
- java.net.ConnectException
- java.net.SocketTimeoutException
- org.springframework.web.reactive.function.client.WebClientResponseException$ServiceUnavailable
ignore-exceptions:
- com.example.OrderClientException # 4xx errors — don't retry
@Retry(name = "order-service", fallbackMethod = "getOrderFallback")
@CircuitBreaker(name = "order-service", fallbackMethod = "getOrderFallback")
public Order getOrder(String orderId) {
return client.get()
.uri("/orders/{id}", orderId)
.retrieve()
.bodyToMono(Order.class)
.block(Duration.ofSeconds(5));
}
@Retry and @CircuitBreaker compose — retries happen first, then the circuit breaker tracks the final outcome. If all retries fail, the circuit breaker records one failure (not three).
Jitter. Exponential backoff without jitter causes synchronized retries when multiple callers hit the same service failure simultaneously — they all back off to the same interval and retry together. Add jitter:
resilience4j:
retry:
instances:
order-service:
enable-exponential-backoff: true
exponential-backoff-multiplier: 2
randomized-wait-factor: 0.5 # ±50% jitter on wait duration
With randomized-wait-factor: 0.5 and a 500ms base, retries happen between 250ms and 750ms — staggered across callers.
Rate limiter — protecting downstream services
A rate limiter prevents sending more requests than a downstream service can handle:
resilience4j:
ratelimiter:
instances:
payment-service:
limit-for-period: 100 # max 100 calls per period
limit-refresh-period: 1s # period resets every second
timeout-duration: 100ms # wait up to 100ms to acquire a permit
@RateLimiter(name = "payment-service", fallbackMethod = "chargeRateLimitFallback")
public PaymentResult charge(PaymentRequest request) {
return paymentClient.charge(request);
}
private PaymentResult chargeRateLimitFallback(PaymentRequest request,
RequestNotPermitted ex) {
throw new RateLimitExceededException("Payment service rate limit exceeded");
}
The rate limiter is useful when a downstream service has documented rate limits (payment processors, email services, third-party APIs). It prevents your service from exceeding those limits and triggering 429 responses.
Bulkhead — isolating failures
A bulkhead limits the concurrent calls to a downstream service, preventing one slow downstream from consuming all available threads:
resilience4j:
bulkhead:
instances:
order-service:
max-concurrent-calls: 20 # max 20 concurrent calls
max-wait-duration: 100ms # wait up to 100ms if limit reached
@Bulkhead(name = "order-service", type = Bulkhead.Type.SEMAPHORE)
public Order getOrder(String orderId) {
return client.get()
.uri("/orders/{id}", orderId)
.retrieve()
.bodyToMono(Order.class)
.block(Duration.ofSeconds(5));
}
The semaphore bulkhead limits concurrent calls using a counting semaphore. THREAD_POOL bulkhead uses a dedicated thread pool per downstream service — stronger isolation but higher thread overhead.
Combining patterns — the right order
Multiple Resilience4j annotations compose. The execution order matters:
@Retry(name = "order-service") // 1. outermost — retries the whole thing
@CircuitBreaker(name = "order-service") // 2. checks circuit state before attempt
@Bulkhead(name = "order-service") // 3. acquires concurrent call permit
@TimeLimiter(name = "order-service") // 4. applies timeout to the call
public CompletableFuture<Order> getOrder(String orderId) {
return CompletableFuture.supplyAsync(() ->
client.get()
.uri("/orders/{id}", orderId)
.retrieve()
.bodyToMono(Order.class)
.block()
);
}
The standard order: Retry → CircuitBreaker → Bulkhead → TimeLimiter → actual call. The retry wraps everything — if the circuit is open or the bulkhead is full, the retry sees a CallNotPermittedException or BulkheadFullException and may retry based on configuration. Usually you want to configure retry to NOT retry these — only retry on transient network errors:
resilience4j:
retry:
instances:
order-service:
ignore-exceptions:
- io.github.resilience4j.circuitbreaker.CallNotPermittedException
- io.github.resilience4j.bulkhead.BulkheadFullException
The timeout that actually matters
A timeout without a fallback just means slower failure. The meaningful timeout is the one that protects your thread pool from being exhausted:
public Order getOrder(String orderId) {
try {
return client.get()
.uri("/orders/{id}", orderId)
.retrieve()
.bodyToMono(Order.class)
.timeout(Duration.ofSeconds(3)) // WebClient-level timeout
.block(Duration.ofSeconds(4)); // block() timeout as safety net
} catch (TimeoutException | WebClientRequestException ex) {
log.warn("Timeout calling order-service for order {}", orderId);
return orderCache.getIfPresent(orderId)
.orElse(Order.unavailable(orderId));
}
}
Two timeout layers: .timeout() on the reactive pipeline cancels the subscription after 3 seconds; .block() with a 4-second timeout prevents the thread from blocking indefinitely if the reactive timeout fires but doesn't propagate cleanly. The fallback (cache or degraded response) determines whether the caller experiences the failure.
Health and observability for resilience patterns
Resilience4j exports metrics to Micrometer automatically:
resilience4j.circuitbreaker.calls{name="order-service", kind="successful"}
resilience4j.circuitbreaker.calls{name="order-service", kind="failed"}
resilience4j.circuitbreaker.calls{name="order-service", kind="not_permitted"}
resilience4j.circuitbreaker.state{name="order-service"} # 0=CLOSED, 1=OPEN, 2=HALF_OPEN
resilience4j.retry.calls{name="order-service", kind="successful_without_retry"}
resilience4j.retry.calls{name="order-service", kind="successful_with_retry"}
resilience4j.retry.calls{name="order-service", kind="failed_with_retry"}
Alert on:
state=1(OPEN circuit) for more than 60 seconds — downstream service is persistently degradedkind="failed_with_retry"rate increase — retries are failing, indicating a persistent downstream issuekind="not_permitted"rate — callers hitting open circuits, indicating a degraded user experience
The combination of circuit breakers, retries, bulkheads, and monitoring creates a system that degrades gracefully rather than cascading — downstream failures are contained, retried appropriately, and surfaced through metrics before users experience the full impact.