Retry Logic Sounds Simple Until It Makes Things Worse

by Arif Ikhsanudin, Backend Developer

How retries can make an outage worse

Payment Service goes down for 90 seconds. Order Service, which calls Payment Service synchronously, retries every failed request three times with a 1-second delay. Order Service is handling 50 requests per second. During the 90-second outage, approximately 13,500 requests accumulate. When Payment Service recovers, it immediately receives those 13,500 queued retries — roughly 150 requests per second — on top of the 50 new requests per second arriving normally. Payment Service, which just recovered from an overload condition, is now receiving 3x its normal traffic. It goes back down. The cycle continues.

This is the thundering herd problem, and it is caused by retry logic that doesn't account for the systemic effect of many callers retrying simultaneously.

The four components of correct retry behavior

Exponential backoff: each retry waits longer than the previous one. If the first retry waits 100ms, the second waits 200ms, the third waits 400ms. This reduces the retry rate over time and gives the downstream service space to recover.

Jitter: add randomness to the backoff interval. Without jitter, all instances of a caller retry at the same intervals — 100ms, 200ms, 400ms — producing synchronized retry bursts. With jitter (full jitter: random value between 0 and the backoff interval), retries spread across the recovery period:

// Exponential backoff with full jitter
long computeBackoff(int attempt, long baseMs, long maxMs) {
    long exponential = (long) (baseMs * Math.pow(2, attempt));
    long capped = Math.min(exponential, maxMs);
    return (long) (Math.random() * capped); // full jitter
}

AWS's SDKs use this pattern by default. Most HTTP client libraries (OkHttp, Apache HttpClient) require you to configure it explicitly.

Retry budget: limit total retry attempts. Three retries is a common maximum. Beyond that, the request is likely failing for a reason that waiting won't fix, and you're just consuming resources. Some teams implement retry budgets at the service level rather than per-request — if more than 10% of requests in a window are retries, stop retrying entirely and let the circuit breaker take over.

Idempotency: retries are only safe if the operation being retried is idempotent — repeating it has the same effect as doing it once. GET requests are inherently idempotent. POST requests that create resources are not — without idempotency protection, a network timeout after a successful payment charge, followed by a retry, charges the user twice.

The correct pattern for idempotent mutations uses an idempotency key sent with the request:

POST /payments
Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
Content-Type: application/json

{
  "orderId": "order-123",
  "amount": 99.99,
  "currency": "USD"
}

The server stores processed requests by idempotency key. If the same key arrives again (from a retry), it returns the stored result without re-executing the payment. The client generates the key before the first attempt and reuses it on every retry:

String idempotencyKey = UUID.randomUUID().toString(); // generated once per logical operation
for (int attempt = 0; attempt < MAX_RETRIES; attempt++) {
    try {
        return paymentClient.charge(request, idempotencyKey);
    } catch (TransientException e) {
        if (attempt == MAX_RETRIES - 1) throw e;
        Thread.sleep(computeBackoff(attempt, 100, 5000));
    }
}

What not to retry

Not every failure should be retried. Retrying a 400 Bad Request (client error) is pointless — the request is malformed and retrying with the same payload produces the same error. Retrying a 409 Conflict (business rule violation: item out of stock) wastes resources. Only transient errors — 503 Service Unavailable, 429 Too Many Requests, network timeouts, connection resets — benefit from retry.

public boolean isRetryable(Exception ex) {
    if (ex instanceof FeignException feignEx) {
        int status = feignEx.status();
        return status == 503 || status == 429 || status == 502 || status == 504;
    }
    if (ex instanceof SocketTimeoutException || ex instanceof ConnectException) {
        return true;
    }
    return false;
}

For 429 Too Many Requests specifically, the server may include a Retry-After header indicating how long to wait. Honor it:

if (ex instanceof FeignException feignEx && feignEx.status() == 429) {
    String retryAfter = feignEx.responseHeaders()
        .getOrDefault("Retry-After", List.of("1")).get(0);
    Thread.sleep(Long.parseLong(retryAfter) * 1000);
}

Retry and circuit breakers together

Retries and circuit breakers work best together. The circuit breaker prevents retrying against a known-down service (stops calls immediately rather than retrying into a black hole). The retry logic handles transient blips before the circuit breaker threshold is hit. Configure them with the right relationship:

  • Retry: 3 attempts with exponential backoff + jitter
  • Circuit breaker: opens after 50% failure rate in a 20-call sliding window

With these settings, a genuine service outage triggers the circuit breaker (preventing retries from amplifying load) while brief transient failures (connection resets, single slow responses) are handled by retries before the circuit threshold is reached.

Test this combination explicitly in staging by injecting failures at different rates and durations. Verify that transient failures (< 30 seconds) are handled by retries without triggering the circuit breaker, and genuine outages (> 60 seconds) open the circuit breaker and stop retry amplification.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

REST vs Messaging in Microservices: Picking the Wrong One Will Hurt You

REST and asynchronous messaging are not interchangeable communication styles — they make fundamentally different promises about consistency, coupling, and failure behavior, and choosing the wrong one for a given interaction is a load-bearing architectural mistake.

Read more

Red Flags That Predict Software Project Failure

“It’s probably fine… we just need a bit more time.” That sentence has quietly preceded more failed projects than anyone admits.

Read more

SSRF, Path Traversal, and Other Spring Boot Vulnerabilities That Don't Get Enough Attention

SQL injection and XSS get attention. SSRF, path traversal, ReDoS, XXE, and deserialization vulnerabilities are less discussed but appear regularly in penetration tests and bug bounty reports. Here is how each manifests in Spring Boot and how to prevent it.

Read more

Reactive Programming in Spring Boot — WebFlux, When to Use It, and When Not To

Spring WebFlux enables non-blocking, reactive HTTP handling. It solves a specific problem — high-concurrency I/O-bound services — and creates new problems for everything else. Here is what it actually does and the honest case for when it's worth adopting.

Read more