What Actually Happens to Your System When Traffic Suddenly Spikes
by Arif Ikhsanudin, Backend Developer
The Spike Arrives
Your product gets featured somewhere — a newsletter, a social post, a press article. Traffic goes from 50 requests per second to 800 in under two minutes. Your monitoring shows a response time increase, then a surge of 502 errors, then total unavailability. The spike lasts 20 minutes. The recovery takes 45 minutes. You never get those users back.
The question is not just "why did the system go down." The question is what specifically happened in those two minutes between the spike starting and the system falling over — because that sequence determines what you need to change.
The Failure Cascade
Traffic spikes produce failure through a cascade, not a single event. Understanding the cascade tells you where to intervene.
Stage 1: Thread pool saturation. Your application server has a fixed thread pool — typically 200 threads in a default Tomcat or similar configuration. Each incoming request consumes a thread. When requests arrive faster than they complete, the pool fills. New connections start queuing. Response times climb as requests wait for an available thread.
Stage 2: Database connection pool exhaustion. Each application thread holding a database connection holds it for the duration of the request. As threads queue up and requests slow, connections stay open longer. The database connection pool — typically 10–100 connections in most configurations — exhausts. Application threads now block waiting for a connection. This makes every request slower, which backs up threads further.
Stage 3: Timeout cascades. Requests that have been waiting too long hit their timeout. The client retries. Now you have the original traffic plus retry traffic hitting an already-saturated system. Retry storms are one of the most common causes of recovery taking longer than the original spike.
Stage 4: Memory pressure and GC pause. Queued requests are objects in memory. As the queue grows, heap usage climbs. The JVM — or equivalent runtime — spends increasing time in garbage collection. GC pauses stop the world momentarily. During a pause, no requests complete. When the pause ends, all queued responses attempt to complete simultaneously.
# What happens to a typical Java/Tomcat stack under spike:
t=0s: 50 req/s | threads: 30/200 | db conns: 20/50
t=30s: 800 req/s | threads: 180/200 | db conns: 48/50 <- near limit
t=60s: 800 req/s | threads: 200/200 | db conns: 50/50 <- SATURATED
| queue depth: 400 requests waiting
| p50 latency: 4s (was 80ms)
t=90s: 800 req/s | incoming requests timeout-fail immediately
| retry storm begins
| GC pressure increasing
t=120s: OUTAGE | 502s or connection refused
Where Each Intervention Fits
Connection pool sizing: Not as large as possible. A large database connection pool under spike can overwhelm the database itself, which has its own connection limit and query concurrency ceiling. The correct connection pool size is a function of your database's max_connections and how many application instances share it. Over-sizing shifts the bottleneck to the database without solving it.
Request timeouts: Set aggressive timeouts on all outbound calls — database queries, external service calls, cache operations. A request that times out at 500ms clears a thread. A request that waits 30 seconds for a database connection holds a thread for 30 seconds and contributes to the cascade. Circuit breakers (Resilience4j, Hystrix) wrap this in a policy: after N failures in a window, stop sending requests and fail fast.
Load shedding: When the system is saturated, the right behavior is to reject new requests with a 429 or 503 immediately rather than queue them. A fast rejection is recoverable. A queued request that waits 10 seconds and then fails has consumed thread time, connection time, and memory for no benefit. Rate limiting at the load balancer or API gateway level is the right place to implement this.
Autoscaling lag: Cloud autoscaling reacts to metrics — CPU, request rate — with a lag of 2–5 minutes for instance spin-up. Most traffic spikes are over before new instances are healthy. Autoscaling helps with sustained load growth, not sharp spikes. For spikes, you need headroom: run at 40–50% capacity rather than 80%, so there is room to absorb a spike while autoscaling responds.
The Design Implication
Systems that handle spikes gracefully have two properties: they fail fast rather than queue up, and they degrade gracefully rather than collapse completely. Failing fast means setting timeouts everywhere and refusing to hold threads waiting indefinitely. Degrading gracefully means identifying which features can return cached or approximate responses when the system is under pressure, so core functionality continues even when the expensive operations are shedding load.
Design for the spike before the spike arrives. The window between a spike starting and the cascade completing is measured in seconds to minutes. There is no time to intervene manually.