Synchronous Communication in Microservices Is a Trap

by Eric Hanson, Backend Developer at Clean Systems Consulting

How you end up with a system worse than what you started with

You split the monolith. Each service has its own database, its own deployment pipeline, its own team. On paper it looks like you've achieved the independence microservices promise. In production, when the Inventory Service has a 30-second GC pause, your Order Service starts timing out, your API Gateway starts returning 503s, and your users see checkout failures.

This is not an Inventory Service problem. It's an architecture problem. You've built a system where the availability of a single service determines the availability of every upstream service that depends on it — the same failure coupling you had in the monolith, now expressed through network calls instead of function calls. The monolith at least failed fast. Now your failure cascades slowly through retry queues and thread pool exhaustion.

Why synchronous chains are so dangerous

The math is straightforward. If each service has 99.9% availability (the three-nines SLA many teams consider acceptable), a chain of three synchronous dependencies gives you 99.9% × 99.9% × 99.9% = 99.7% combined availability. That's roughly 22 hours of downtime per year from a chain of individually reliable services. A chain of ten services — not unusual in complex microservices architectures — drops to 99.0%.

More practically: synchronous coupling means that any slow service in the chain becomes a slow service for every caller. If Inventory Service has a database query that degrades from 20ms to 2,000ms under high load, Order Service's request handling blocks those threads. If Order Service uses a fixed-size thread pool (common with Spring's default Tomcat connector or traditional servlet containers), those threads fill up, requests queue, and Order Service itself becomes unavailable — not because Order Service has a bug, but because Inventory Service is slow.

This is cascading failure. It's a distributed systems property, not a bug you can fix in any single service.

The latency addition problem

Even without failures, synchronous service chains add latency multiplicatively. A request that touches five services with average response times of 50ms each takes at minimum 250ms — and that's with perfect parallelism. If those calls must be sequential (each depends on the result of the previous), you're looking at 250ms of pure network + processing time before your service adds any of its own processing time.

At 250ms, you're already above many UX guidelines for perceived responsiveness in interactive applications. For mobile clients with higher round-trip latency, it compounds further.

The specific failure modes synchronous calls introduce

Thread pool saturation: Blocking threads waiting for downstream responses consume resources. Under slow downstream conditions, thread pools saturate faster than load increases. This is why even modest traffic spikes during a downstream degradation can take an upstream service fully offline.

Retry amplification: If Service A retries failed calls to Service B three times, a brief B outage generates 3x the load on B when it recovers. If multiple services retry simultaneously (the thundering herd problem), the recovering service gets hammered with amplified retry traffic before it can stabilize.

Timeout misconfiguration: If A calls B with a 5-second timeout, and B calls C with a 5-second timeout, A's effective timeout is 10+ seconds — longer than A's own callers probably expect. Timeout values rarely account for the full call chain depth.

Moving interactions to async where possible

The fundamental fix is to identify which synchronous interactions are not actually synchronous by necessity and convert them to event-driven patterns.

An Order Service that synchronously calls a Notification Service to send a confirmation email has no business doing so. The user doesn't wait for the email before their checkout completes. The interaction should be:

// Before: synchronous, adds latency and couples availability
notificationService.sendOrderConfirmation(order); // HTTP call

// After: publish event, notification service handles asynchronously
eventPublisher.publish(new OrderConfirmedEvent(order.getId(), order.getUserId()));
// Notification service consumes from Kafka and sends email independently

The Notification Service can be down for six hours. Orders are not affected. When it recovers, it processes the backlog of events. Nothing is lost. No retries needed at the Order Service level.

Designing for partial availability

For interactions that are genuinely synchronous — where the response is needed before proceeding — you need to design each call for the assumption that it will sometimes be slow or unavailable.

Circuit breakers (Resilience4j in Java, resilience4go in Go, Polly in .NET) wrap synchronous clients and stop calls to unhealthy downstreams before thread pools saturate:

@CircuitBreaker(name = "inventoryService", fallbackMethod = "getInventoryFallback")
public InventoryStatus getInventory(String itemId) {
    return inventoryClient.getStatus(itemId);
}

private InventoryStatus getInventoryFallback(String itemId, Exception e) {
    // Return cached data, or degrade gracefully
    return InventoryStatus.assumeAvailable(itemId);
}

The fallback is where you earn your keep architecturally. "Return cached data" is fine for catalog information that changes slowly. "Assume available" is a business risk decision — you might accept orders for out-of-stock items and deal with fulfillment failures downstream. Know what you're trading.

Synchronous calls you can't eliminate should be short, independently resilient, and have a fallback that degrades gracefully rather than failing hard. For the rest — the notifications, the audit logs, the analytics, the downstream fulfillment triggers — publish events and stop waiting for responses you don't need.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

Why Some Contractors Get Hired Again and Others Never Hear Back

Repeat business is not about being the most talented contractor — it is about being the one clients remember as easy, reliable, and worth calling again.

Read more

Email Templates for Junior Contractors Who Don’t Know What to Say

Sending emails as a junior contractor can feel like walking a tightrope. These simple templates make communication easier, without sounding stiff or fake.

Read more

Why Oslo Startups Are Using Remote Backend Contractors to Escape Norway's Salary Spiral

Every year the salary expectation goes up. Every year your runway gets shorter. At some point the maths stops working — and you need a different equation.

Read more

Reviewing Code You Don't Fully Understand Is More Common Than You Think

Most developers have approved a PR they didn't fully understand and said nothing. This is a competence problem less often than it's a process problem — and it's fixable without anyone having to admit ignorance.

Read more