What Actually Happens When You Put a Load Balancer in Front of Your App
by Eric Hanson, Backend Developer at Clean Systems Consulting
The Architecture Diagram Lie
In architecture diagrams, the load balancer is a rectangle with arrows pointing at a cluster of identical boxes. Traffic comes in, gets distributed, problem solved. It looks mechanical and obvious. In practice, adding a load balancer introduces a set of behavioral changes to your application that you need to understand before they surprise you in production.
This is not a documentation exercise. These are specific, concrete behaviors that cause production incidents on teams that didn't think them through.
Session State Assumptions
The most common surprise: your application stores user session data in memory. User authenticates, session object lives in the application process. Works perfectly with one instance. Add a second instance behind a load balancer and round-robin routing, and the user's next request may land on the other instance. No session. User is logged out. This is not a load balancer bug. It's an application that was never designed for horizontal scaling.
The fix is usually one of:
- Sticky sessions (session affinity in load balancer config): all requests from the same client route to the same instance. Solves the immediate problem, undermines even load distribution, and creates a single point of failure per user session.
- Distributed session storage: move session state out of process into Redis or a database. Stateless instances. Any request can land anywhere. This is the correct architecture for horizontally scaled systems; it's also more complex.
The load balancer didn't break your application. It exposed a design assumption that was always there.
Health Checks and What "Healthy" Means
Load balancers remove unhealthy instances from rotation. They determine health by polling a health check endpoint you define. This sounds simple and has several non-obvious implications.
Health checks that only verify the process is running are nearly useless. An instance can be running and completely unable to serve traffic — its database connection pool exhausted, its downstream dependencies unreachable, its thread pool saturated. A health check at /health that returns 200 because the HTTP server is alive will keep a broken instance in rotation.
A meaningful health check verifies readiness, not just liveness:
@GetMapping("/health/ready")
public ResponseEntity<Map<String, String>> readiness() {
Map<String, String> status = new LinkedHashMap<>();
// Check DB connectivity with a lightweight query
try {
jdbcTemplate.queryForObject("SELECT 1", Integer.class);
status.put("database", "ok");
} catch (Exception e) {
status.put("database", "error: " + e.getMessage());
return ResponseEntity.status(503).body(status);
}
// Check connection pool availability
if (dataSource.getConnection() == null) {
status.put("pool", "exhausted");
return ResponseEntity.status(503).body(status);
}
return ResponseEntity.ok(status);
}
The Kubernetes distinction between livenessProbe and readinessProbe exists for exactly this reason: an instance that is alive but not ready should be restarted or pulled from rotation, not treated identically.
Connection Draining and In-Flight Requests
When a load balancer removes an instance from rotation — for a deploy, a scale-down, or a health check failure — requests that are already in flight to that instance don't stop. If you kill the instance immediately, those requests fail. Users see errors.
Connection draining (called "deregistration delay" in AWS ALB, configurable in most load balancers) allows a grace period: the instance stops receiving new connections but finishes serving existing ones. The default is typically 30 seconds. Whether this is enough depends on your longest requests.
For a service where the 99th-percentile request duration is 200ms, 30 seconds is generous. For a service that processes batch jobs that can run for 10 minutes, you need either a much longer drain window or a different strategy for long-running requests.
Timeouts at Every Layer
A load balancer introduces a new timeout boundary. Most load balancers have their own idle connection timeout (AWS ALB defaults to 60 seconds) and a request timeout. If your backend service takes longer than the load balancer's timeout to respond, the load balancer closes the connection and returns an error to the client — regardless of whether your backend eventually produces a valid response.
This means your application's own timeout settings need to be shorter than the load balancer's timeouts, which need to be shorter than the client's timeouts. Timeout hierarchy:
Client timeout > Load balancer timeout > Application timeout > Downstream timeout
Violating this hierarchy produces symptoms that look like intermittent failures: requests succeed most of the time but fail for some users, with no apparent pattern. The pattern is response time — it's the requests that take longer than the most restrictive timeout in the chain.
The HTTP vs TCP Layer Choice
Layer 4 load balancers (TCP) route connections without inspecting HTTP. They're fast and simple. They also don't understand HTTP concepts like host headers, URLs, or response codes. You can't do path-based routing, you can't strip SSL, and you can't route based on request attributes.
Layer 7 load balancers (HTTP) — AWS ALB, nginx, HAProxy in HTTP mode — understand HTTP. They can route /api/* to one backend and /static/* to another, terminate TLS, inject headers, and make routing decisions based on request content. They're slower and more complex. They're also what most applications actually need.
The Practical Takeaway
Before adding a load balancer to your stack, audit your application for three things: where session state lives, whether your health check endpoint reflects actual readiness, and what your longest-running requests are. Those three answers determine the majority of load balancer configuration decisions you'll need to make — and the incidents you'll have if you get them wrong.