Why Your Services Can't Stop Talking to Each Other
by Eric Hanson, Backend Developer at Clean Systems Consulting
What chatty services are telling you
Your order service calls the user service for profile data, the credit service for limit checks, the inventory service for availability, and the shipping service for rate calculations — all within a single request. You've added aggressive caching, reduced timeout windows, and deployed Envoy as a service mesh, and the latency is still unacceptable. The problem is not the network. The problem is that you've drawn your service boundaries in the wrong places and are now compensating with infrastructure.
Chatty services — services that can't serve a request without making multiple synchronous calls to other services — are a consistent indicator of one or more of these underlying issues: bounded contexts that were cut along technical layers rather than business domains, data that belongs in one service but lives in another, or orchestration logic that should be events-based but is synchronous by design.
The layered architecture trap
The most common cause of chatty services is drawing service boundaries along technical layers rather than business capabilities. Teams coming from layered monolith architecture (presentation, business logic, data access) replicate that structure as services: a "data service," a "business logic service," an "API gateway service." This is backwards.
A "data service" that just wraps database access for other services is not a microservice. It's a remote repository layer. Every business operation requires calling it, which means every service is permanently coupled to it. Adding caching doesn't fix this — it just trades staleness risk for latency improvement while keeping the fundamental coupling intact.
Services should own their data and expose business capabilities, not raw data access:
❌ Layered (creates chatty services)
Request → Order Logic Service
→ GET /data/users/{id} (User Data Service)
→ GET /data/inventory/{id} (Inventory Data Service)
→ GET /data/prices/{id} (Price Data Service)
→ do logic locally
→ POST /data/orders (Order Data Service)
✅ Domain-oriented (services own their data)
Request → Order Service
→ GET /users/{id}/order-context (User Service — returns only what ordering needs)
→ POST /orders/initiate (Order Service does its own writes)
Async: publishes OrderInitiated event
Domain data replication as a coupling solution
When a service legitimately needs data from another domain for its own operations, the answer is often not a synchronous call — it's a local copy of the relevant data, kept current via events.
The Order Service needing to check whether a user is in good standing (active account, no fraud flags) does not require a synchronous call to the User Service on every order request. The User Service can publish UserStatusChanged events to a Kafka topic. The Order Service maintains a local user_status table, consuming those events:
CREATE TABLE user_order_eligibility (
user_id UUID PRIMARY KEY,
is_eligible BOOLEAN NOT NULL DEFAULT TRUE,
reason VARCHAR(255),
updated_at TIMESTAMP NOT NULL
);
Now the Order Service checks eligibility locally with a single DB read. No network call. No dependency on User Service uptime. The data is eventually consistent — if a user is flagged for fraud, there's a short window where they could still place orders. For most systems, that window (seconds to milliseconds, depending on event processing lag) is acceptable. If it's not acceptable, you have a synchronous query requirement, and you should model it that way explicitly.
Orchestration versus choreography
Another source of chatty services is orchestration-heavy design: one service calling a sequence of other services to drive a workflow. The Order Service calls Inventory Service to reserve stock, calls Payment Service to charge the card, calls Fulfillment Service to schedule delivery. Every step is a synchronous dependency, every failure cascades.
Choreography — event-driven coordination — reduces this coupling. Each service reacts to events from the previous step without being called:
Order Service publishes: OrderConfirmed
→ Inventory Service consumes: reserves stock, publishes: StockReserved
→ Payment Service consumes: charges card, publishes: PaymentCollected
→ Fulfillment Service consumes: schedules delivery, publishes: ShipmentScheduled
No service calls another directly. The workflow emerges from event subscriptions. Adding a new step (fraud check between order confirmation and inventory reservation) means a new consumer, not a change to Order Service. Removing a step means removing a consumer. The coupling is to the event schema, not to other services' APIs.
The downside: workflow state is distributed. Debugging a failed workflow requires correlating events across multiple topics and services. You need distributed tracing and event correlation IDs from the start, not as an afterthought.
When synchronous calls are unavoidable
Some inter-service calls are genuinely synchronous requirements: real-time credit decisions, inventory availability at checkout, pricing at point of sale. These should be the exception, not the default, and they should be designed with the assumption that the downstream service will sometimes be slow or unavailable.
If after restructuring your domain model you still have five synchronous calls per request, look at whether those calls can be parallelized. If they're independent, fan them out concurrently:
CompletableFuture<UserContext> userFuture =
CompletableFuture.supplyAsync(() -> userClient.getOrderContext(userId));
CompletableFuture<InventoryStatus> inventoryFuture =
CompletableFuture.supplyAsync(() -> inventoryClient.getStatus(itemIds));
CompletableFuture.allOf(userFuture, inventoryFuture).join();
// total latency = max(user latency, inventory latency), not sum
But if you find yourself doing this routinely, it's still a signal that the domain model is wrong — you're compensating for a boundary problem with concurrency tricks.
The right question when services won't stop talking: which of these calls could be eliminated by moving data ownership to the service that needs it? Answer that first. Then optimize the calls that remain.