Microservices Sound Great on Paper. Here Is the Part Nobody Talks About.
by Eric Hanson, Backend Developer at Clean Systems Consulting
The pitch versus the reality
Your team just decided to go microservices. The architecture diagrams look clean, ownership boundaries are drawn, and the roadmap shows three services shipping independently by Q3. Six months later, you're debugging a checkout failure that spans four services, three databases, and a Kafka topic, and nobody can tell you exactly where it broke.
This is not a failure of execution. It is the expected outcome of adopting microservices without confronting what they actually cost.
The benefits are real: independent deployability, technology flexibility, isolated failure domains, and team autonomy at scale. Netflix, Uber, and Amazon use microservices, and it works for them. What the case studies leave out is that those organizations built significant platform infrastructure — internal PaaS layers, service meshes, distributed tracing systems, and dedicated reliability engineering teams — before microservices became manageable. They did not adopt microservices and then figure out the platform. They built the platform first.
Distributed systems problems that your monolith didn't have
When you split a monolith into services, every in-process function call becomes a network call. That sounds trivial until you internalize what it means:
- Partial failures are now normal. Service A calls Service B, which is slow but not down. A's thread pool fills up waiting. A starts failing. Its callers start failing. You now have a cascading failure that a single slow database query would never have caused.
- Transactions don't cross service boundaries. If your order service writes to its DB and then calls the inventory service, and the inventory call fails, you now have an order with no inventory reservation. Two-phase commit is theoretically possible but operationally nightmarish. You end up implementing sagas — compensating transactions that require careful state machine design.
- Data consistency is eventual, not guaranteed. If you publish an event to Kafka after a DB write and the process dies between those two operations, your event is lost. If you write the event first and the DB write fails, you have a phantom event. Getting this right requires the outbox pattern (write to a local
outboxtable in the same transaction, then relay to Kafka), which adds complexity most teams don't plan for.
-- Outbox pattern: same transaction as your domain write
BEGIN;
INSERT INTO orders (id, user_id, total) VALUES (?, ?, ?);
INSERT INTO outbox (event_type, payload, created_at)
VALUES ('order.created', '{"orderId": ...}', NOW());
COMMIT;
-- A relay process polls outbox and publishes to Kafka
The operational surface area you're signing up for
A monolith in production means one deployment artifact, one log stream, one set of metrics, and one place to look when something breaks. Microservices multiply each of those by the number of services. With ten services:
- Ten CI/CD pipelines to maintain
- Ten sets of Kubernetes resource limits to tune
- Ten log streams to correlate manually unless you have centralized logging with trace IDs propagated through every service
- Ten health check endpoints, none of which tell you about cross-service dependency health
The hidden cost here is not engineering time — it's cognitive load. When an alert fires at 2 AM, the first question is always: which service? Your on-call engineer now needs enough context about every service in the system to triage effectively, or you need runbooks specific enough to be useful, or you need distributed tracing (Jaeger, Zipkin, or OpenTelemetry with a backend) that lets them follow a request across service boundaries.
Most teams get the services. They get the tracing only after the third or fourth production incident where they couldn't find the root cause.
What nobody tells you about inter-service contracts
In a monolith, changing a function signature is a compiler error. In microservices, changing an API response shape is a production incident waiting to happen. If Service A starts omitting a field that Service B expects, B fails — silently, or noisily, depending on your error handling.
You need API versioning from day one. Not "we'll add it later." Consumer-driven contract testing (Pact is the standard tool here) lets consumer services define what they expect from providers, and breaks the provider's CI if those expectations aren't met. This is non-negotiable discipline in a mature microservices organization. It is rarely in place when teams first migrate.
The honest tradeoff
Microservices are the right architecture for organizations with enough teams that a monolith creates deployment bottlenecks — where ten teams waiting for one release train is genuinely worse than the distributed systems complexity you're taking on. For a team of five to fifteen engineers, that tradeoff almost never makes sense.
The work that makes microservices survivable — centralized logging with correlation IDs, distributed tracing, the outbox pattern for reliable event publishing, consumer-driven contract tests, circuit breakers — is the work that gets skipped when teams move fast and commit to the architecture before committing to the platform.
Before your next service split, ask: do you have distributed tracing deployed and actually used by your engineers? Do you have contract tests in CI? Do you have a documented saga pattern for your cross-service workflows? If the answers are no, you are not ready to split — you are ready to make the existing system harder to operate.
Pick one of those three gaps and close it. Then revisit whether the split is still necessary.