Your System Does Not Need to Scale to a Million Users on Day One
by Eric Hanson, Backend Developer at Clean Systems Consulting
The Hypothetical Million Users
The requirement shows up early, usually from the business side: "We need this to scale to a million users." The engineering team takes it seriously, adds Kubernetes, a service mesh, distributed caching, and a sharded database — all before the first real user has logged in. The system takes four months to build instead of six weeks. The startup runs out of runway. The million users never materialize because the product was never validated.
This is not a hypothetical. It is a common failure mode, and it is driven by a misunderstanding of what scale requires and when.
Designing for a million users when you have zero means making decisions without data. You do not know your read-to-write ratio. You do not know which features users will actually use. You do not know your query patterns. You do not know whether your bottleneck will be compute, I/O, or network. Every architectural decision you make in that vacuum is a guess. Some guesses will be right. Others will create constraints that make the system harder to change once you actually have users and data to learn from.
What You Actually Need at Early Scale
A well-configured PostgreSQL instance on a $400/month cloud VM handles a surprising amount of load. Under typical read-heavy workloads with proper indexing — primary key lookups, indexed foreign key joins, no full table scans — a single PostgreSQL instance can handle thousands of queries per second. With connection pooling via PgBouncer, that number stays stable under high concurrency.
A monolithic application deployed behind a load balancer to two or three instances handles most early-stage traffic requirements. Horizontal scaling of stateless application servers is cheap and fast when you need it. The decision to add instances takes minutes, not months.
# Reasonable starting configuration for most applications:
Application: 2x stateless app servers behind a load balancer
Database: Single PostgreSQL primary, RDS db.t3.large
PgBouncer for connection pooling
Read replica when query load warrants it
Cache: Redis for session storage and hot-path queries
Single instance to start
Queue: SQS or a simple Postgres-backed job queue (Sidekiq, pg-boss)
For async work that doesn't belong in the request cycle
This handles tens of thousands of daily active users without drama.
The key is that this configuration can evolve. Read replicas are additive. A caching layer is additive. Extracting a high-throughput service when you identify it is a well-understood operation. None of these changes require a redesign from scratch.
The Cost of Premature Scale
Over-engineering has three costs that are rarely accounted for in the excitement of architecture planning:
Time to first deploy. A Kubernetes cluster with service mesh, distributed tracing, and a sharded database takes weeks to configure correctly. A monolith on managed infrastructure takes days. In the early stages of a product, the ability to ship and learn quickly is the primary competitive advantage. Every week of infrastructure work is a week of not learning whether the product is right.
Operational complexity. Distributed systems require distributed operations. Debugging a slow request across four microservices requires distributed tracing, log correlation across services, and familiarity with network-level failure modes. Debugging the same request in a monolith requires a stack trace. The team that chose microservices at early stage has taken on the operational burden of a large engineering organization without the staff to manage it.
Change cost. The irony of building for scale early is that it makes changing the system harder. A microservice architecture with established contracts between services is harder to refactor than a monolith when you discover — as you will — that the initial domain model was wrong.
The Honest Growth Model
Start with the simplest system that could work. Define clear thresholds that would trigger a scaling decision: if read query latency exceeds 100ms at p95, evaluate a read replica. If write throughput saturates the primary, evaluate write sharding or a different data model. If a specific feature's traffic grows 10x relative to the rest of the system, evaluate extracting it.
These thresholds keep the system right-sized based on evidence. They also force the team to understand what is actually happening in the system before changing the architecture.
The goal is to be one architectural step ahead of your actual load, not ten steps ahead of your imagined load. When you reach a million users, you will have enough data and revenue to engineer for it correctly. Until then, you are borrowing operational complexity against a future that has not arrived.