Stateless vs Stateful: The Decision That Affects Everything Downstream
by Eric Hanson, Backend Developer at Clean Systems Consulting
The Decision That Gets Made Implicitly
Most teams don't consciously choose between stateless and stateful service design. They make dozens of small decisions — store user session in memory, cache the user profile object on first load, accumulate request metrics in a local counter — and the result is a stateful service that nobody explicitly chose to build. When they try to scale it, the statefulness fights them.
Understanding this distinction before you build is worth the ten minutes it takes.
What Statefulness Actually Means
A stateless service instance treats every request as independent. It does not store information about prior requests. Any instance in a pool can handle any request correctly. Adding instances increases capacity linearly.
A stateful service instance has memory of prior requests, clients, or connections. The correct handling of a request may depend on what happened to this specific instance before. Not all instances are interchangeable.
The distinction is about the instance, not the system. A stateless service can absolutely persist data — it writes to and reads from a database, a cache, or object storage. The key property is that the state lives outside the instance. The instance itself is ephemeral and interchangeable.
Why Statelessness Is the Default for HTTP Services
Stateless HTTP services have a simple scaling story: put a load balancer in front of N identical instances. When load increases, add instances. When load decreases, remove them. Any instance handles any request. This is horizontal scaling in its purest form.
Stateful HTTP services complicate this immediately:
- Load balancer configuration: You need sticky sessions to ensure a client always reaches the same instance. This unevenly distributes load and creates a single point of failure per client.
- Instance replacement: When you deploy a new version, rolling deploys drop old instances. In-memory state in those instances is lost. Sessions break. Operations in progress may fail.
- Debugging: Two instances of the same service may respond differently to the same request if their in-memory state differs. Reproducing production issues requires knowing which instance the request hit.
None of these are unsolvable. They're costs. The question is whether the benefit of in-instance state is worth those costs.
Where Statefulness Is Justified
Some workloads genuinely require stateful instances:
WebSocket connections: A persistent connection between a client and a server is inherently stateful. The connection lives on one instance. If that instance goes down, the connection is lost. This is managed with reconnection logic on the client side and a message broker for cross-instance fan-out (when you need to broadcast to all clients, regardless of which instance holds the connection).
Stateful protocol implementations: Some protocols, like FTP data connections or certain streaming protocols, maintain connection-level state that can't be externalized cheaply.
High-performance computing where the cost of serializing and deserializing state to an external store on every operation would dominate the processing time. An ML inference server that loads a 2GB model into memory on startup is stateful — and appropriately so.
Gaming servers, real-time collaboration where multiple clients share a live session that requires microsecond-latency coordination. The state needs to be local to the coordinator.
The External State Model
When you externalize state from your instances to a shared store, you need to think carefully about:
Consistency: Redis in single-node mode is eventually consistent under failure. Redis Cluster provides horizontal scalability with a partitioned key space but has specific constraints on multi-key operations. For session data, eventual consistency is usually acceptable. For financial transactions, it is not.
Latency: An external cache round-trip adds ~1ms in a co-located deployment. Under load, with connection pool contention, this can spike significantly. If you're making ten Redis calls per request, that's 10ms of added latency at minimum.
Failure modes: When your external state store is unavailable, your instances may be unable to serve requests that require state access. Design the fallback: what does your service do when Redis is down? Fail closed (return errors), fail open (allow requests with reduced functionality), or use local memory as a fallback with staleness tolerance?
The Architecture Decision Record You Should Write
The stateless vs stateful choice deserves documentation. Write down:
- What state, if any, will live in this service's instances?
- How is that state invalidated or updated when it becomes stale?
- What happens to operations in flight when an instance is replaced?
- How does this design behave when the instance is scaled to zero and restarted?
If you can answer all four cleanly, your design is considered. If question three or four produces "the operation is lost," that's a design decision with real user impact — it should be explicit, not implicit.
The Practical Takeaway
For your next service, default to stateless: no in-memory session, no local cache that isn't backed by an external store, no accumulated state that differs between instances. If you have a requirement that seems to demand statefulness, write it down and verify it can't be met with externalized state before accepting the operational complexity of a stateful design.