Scalability Is Not a Feature. It Is a Consequence of Good Design.
by Eric Hanson, Backend Developer at Clean Systems Consulting
The Scalability Conversation Is Usually Wrong
The meeting happens once a quarter in most engineering teams. Someone raises "scalability" as a concern, and the conversation immediately goes to horizontal scaling, caching strategies, and database partitioning. What rarely happens is someone asking: "Why is the system struggling with current load, and is that a design problem?"
Scalability is not a property you add to a system after it is built. By the time you are reaching for horizontal scaling levers, you have either outgrown your current tier naturally — which is fine — or you have a design problem that scaling will make worse and more expensive to fix.
A system with synchronous blocking calls between ten internal services does not scale horizontally. It replicates its blocking. A database with missing indexes does not scale with more replicas. It replicates slow queries. A service that holds database connections open for the duration of long-running operations does not benefit from adding more instances. It exhausts connection pools faster.
Scalability follows from design. If the design is sound, scaling is mostly an operational exercise. If the design is flawed, scaling is an expensive way of discovering where the flaws are.
What "Good Design" Actually Means for Scalability
Three design properties determine whether a system will scale predictably:
Statelessness in the hot path. If your application servers hold no session state — no in-memory user sessions, no local file locks, no instance-specific caches — adding instances behind a load balancer is trivially effective. If they do hold state, horizontal scaling requires sticky sessions or state synchronization, both of which add complexity and failure modes.
Bounded operations. Every operation in the system should have a bounded, predictable cost. A database query that does a full table scan is not bounded — its cost grows with data volume. A queue consumer that processes one message at a time is bounded — its throughput scales with worker count. Systems made of bounded operations can be scaled predictably. Systems with unbounded operations will hit ceilings that horizontal scaling cannot address.
Explicit coupling. Services that communicate synchronously are coupled at runtime. If Service A calls Service B synchronously, Service A's availability and latency are bounded by Service B's. Under load, this coupling amplifies. Latency in Service B queues up threads in Service A. Failures in Service B cascade to Service A. Explicit async coupling — using a queue between services — breaks this dependency and allows each service to scale independently.
# Implicit runtime coupling (synchronous):
def handle_order(order):
result = inventory_service.reserve(order.items) # blocks
billing_service.charge(order.payment) # blocks
notification_service.send_confirmation(order) # blocks
return result
# If any of these services slow down, the whole endpoint slows down.
# Explicit coupling (async):
def handle_order(order):
db.save_order(order, status="pending")
queue.publish("orders.created", order.id)
return {"status": "accepted"}
# Downstream services consume the queue independently.
# This endpoint is fast and isolated from downstream latency.
Where Scalability Problems Actually Come From
Most systems that struggle under load have one of three root causes:
N+1 query patterns. The application issues one query to get a list of 100 records, then one query per record to fetch associated data — 101 queries for what could have been 2. This does not become visible in development where datasets are small. It becomes visible in production when the list has 10,000 items.
Missing or incorrect indexes. A query that works fine at 10,000 rows becomes a multi-second operation at 1,000,000 rows without the right index. This is not a scaling problem. It is a design problem that load has made visible.
Synchronous fan-out. A single user action triggers synchronous calls to multiple downstream services. Each call adds latency. At low concurrency, this is invisible. At high concurrency, thread pools exhaust and the system queues requests externally.
None of these are solved by horizontal scaling. They are solved by fixing the design. Adding instances replicates the problem.
The Practical Test
Before reaching for a scaling solution, identify what is actually saturating. Use your APM — Datadog, New Relic, Honeycomb — to find where time is actually spent in the request. Is the database the bottleneck? Is it a specific query? Is it a service dependency? Is it lock contention?
Most systems that claim to have a scaling problem have a query problem, a connection pool configuration problem, or a synchronous coupling problem. Fix the design first. Then scale. In that order.
When you fix the design, you will often discover the system handles the current load fine. When you scale first, you run more instances of a broken design and pay more for the privilege.