What Nobody Tells You About Scaling a Backend System
by Eric Hanson, Backend Developer at Clean Systems Consulting
The Scaling Mythology
Scaling has a mythology built around it in the industry. The myth is that it's primarily about infrastructure: more servers, bigger machines, better cloud architecture. The reality is that most systems fail to scale because of application-layer problems that no amount of horizontal scaling can fix. More servers running the same N+1 query pattern just hit the database from more connections simultaneously.
Real scaling work is about understanding limits — where the system runs out of a resource — and eliminating that limit before the next one becomes the bottleneck.
The Resources That Run Out First
Different system components hit different ceilings:
Database connections run out before almost anything else in typical web applications. Each connection consumes memory on both the application and database sides. A PostgreSQL instance with 200 max connections supports roughly 200 concurrent database operations, regardless of how many application servers are running. Scaling from 5 to 50 application servers while each holds 10 idle connections means 500 connections to a 200-connection database — and connection errors at load.
Connection pooling (PgBouncer in transaction mode for PostgreSQL, the HikariCP pool in JVM applications) is the standard response. PgBouncer transaction-mode pooling allows hundreds of application instances to share a small number of database connections by multiplexing at the transaction boundary. At scale, this is not optional.
Thread pools are the mechanism by which blocking I/O limits concurrent requests. In a thread-per-request model, each request waiting for a database response, an HTTP call, or a file read holds a thread. Threads consume memory (~1MB stack by default on JVM). A service configured with 200 threads can handle 200 concurrent blocking operations. Beyond that, requests queue.
The escape from this constraint is reactive or async I/O (Spring WebFlux, Vert.x, Node.js event loop). These models don't dedicate a thread per request — they use non-blocking I/O to handle many concurrent operations on a small thread pool. The tradeoff is programming model complexity: debugging async stack traces is harder, and not all libraries support non-blocking operation.
Memory typically fails in two modes: steady-state growth (a memory leak), or per-request allocations that multiply with concurrent load. An endpoint that allocates 100MB to process a large payload scales to exactly (available memory) / 100 concurrent requests of that type before running out.
The Stateless Prerequisite
Horizontal scaling — adding more instances — requires stateless instances. Any state that lives in an instance (in-memory session data, local file uploads, in-process caches of per-user data) creates asymmetry between instances. Requests for a user on instance A behave differently than the same requests on instance B.
The practical path to stateless instances:
- Sessions in Redis (or a database) rather than in process memory
- File uploads to object storage (S3, GCS) rather than local disk
- Configuration from environment variables or a config service rather than files that vary per instance
- Per-request correlation IDs in a distributed tracing system rather than request state on the instance
This is not complex to implement and does not require a major refactor for most services. It does require doing it before you need to scale, not as an emergency response to a traffic spike.
What Actually Changes at Each Order of Magnitude
The architectural concerns shift as load increases:
~100 req/sec: A single well-tuned application instance, a single database. The bottleneck is usually application logic or database queries. Fix your N+1 queries, add the missing indexes, and you have headroom.
~1,000 req/sec: Horizontal application scaling. Database read replicas for read-heavy workloads. Caching for frequently-read, infrequently-changing data. The bottleneck shifts to the database write path.
~10,000 req/sec: Database sharding or partitioning starts becoming relevant. Connection pooling infrastructure (PgBouncer) is essential. CDN for static assets removes a category of load entirely. The bottleneck is often a specific hot database table or a chatty network path.
~100,000 req/sec: Specialized infrastructure at every layer. At this point, most teams have moved past generic advice — the bottlenecks are specific to your data model and access patterns. Twitter's famous move from MySQL to a tiered object store for tweet storage, Slack's migration from a shared Redis cluster to a partitioned one — these are solutions to specific, measured constraints.
The Measurement Imperative
Nobody told you this because it's not a memorable principle: scaling work is 80% measurement and 20% change. Before you touch anything, you need: current throughput and latency percentiles (p50, p95, p99), resource utilization per component (CPU, memory, connections, I/O), and a load test that reproduces production traffic patterns.
k6 and Gatling are both practical choices for load testing. Run them against a staging environment that mirrors production as closely as possible. Establish your baseline. Then make one change at a time and re-measure. This is slower than making five changes at once. It's the only way to know what worked.
The Practical Takeaway
Before your next scaling effort, answer: what is the specific resource that runs out first under load? You should be able to name it — database connections, thread pool slots, memory, a specific slow query. If you can't name it from measurement, you haven't done the measurement yet. Do that before writing a line of scaling code.