The Performance Bug That Only Appears Under Real Traffic
by Eric Hanson, Backend Developer at Clean Systems Consulting
Why Staging Misses It
You ran the load test. p99 was 180ms. You shipped. Three days after launch, p99 on the same endpoint climbs to 4 seconds on Monday mornings between 9 and 10am, then recovers.
Staging missed it because staging never had Monday morning. It never had the specific distribution of account ages, the specific mix of users with large data sets versus small ones, the specific cache hit ratio that results from thousands of users with overlapping access patterns. It never had the database's table statistics drifting as production data grows asymmetrically across partitions.
Certain performance bugs are genuinely only reproducible under real production conditions. The goal is not to prevent them from reaching production — that is often impossible. It is to find them fast and have the tooling to diagnose them when they appear.
The Classes of Production-Only Performance Bugs
Data skew. Your benchmark data is uniformly distributed. Production data is not. One tenant has 10,000 records; the median tenant has 40. A query that performs identically for every row in staging hits the 10,000-record tenant in production and spends 500ms. Without per-tenant query profiling, this is nearly invisible until a customer complains.
Cache cold starts after deploys. A deploy invalidates the cache. The first minute of traffic after a deploy is uncached. If your uncached response time is 1 second and your cached response time is 10ms, deploys produce a one-minute window where every user sees 100x slower responses. A load test that starts from warm cache will never catch this.
Query plan regression after statistics update. PostgreSQL and MySQL update table statistics periodically, and occasionally this causes the query planner to switch from an index scan to a seq scan for a query that was previously fast. The query was fine yesterday. It is slow today. Nothing in the code changed.
-- The query that was fast yesterday
SELECT * FROM orders
WHERE customer_id = $1
AND status = 'pending'
ORDER BY created_at DESC
LIMIT 10;
-- Check if the plan changed
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders
WHERE customer_id = 12345
AND status = 'pending'
ORDER BY created_at DESC
LIMIT 10;
-- Look for: "Seq Scan" where you expect "Index Scan"
-- Or: rows estimate dramatically wrong (estimate: 5, actual: 50000)
-- Fix: ANALYZE orders; or hint the planner with SET enable_seqscan = off;
Connection pool saturation during traffic spikes. Your connection pool has 20 connections. Your service has 4 instances. That is 80 concurrent database connections. Under normal traffic, you use 30. During a traffic spike that triples requests, you need 90 connections, but the pool is capped at 80. Requests queue. Latency increases. Nothing is "broken" — the system is simply at capacity in a way that your steady-state tests never reached.
The Instrumentation That Makes These Findable
The difference between a production performance incident that takes 30 minutes to diagnose and one that takes 4 hours is almost entirely determined by what you are measuring.
Slow query logging. In PostgreSQL, log_min_duration_statement = 100 logs every query that takes over 100ms. In MySQL, slow_query_log with long_query_time = 0.1 does the same. These logs, aggregated in a tool like pgBadger or sent to your logging infrastructure, will surface the specific queries that are slow on production data.
Per-endpoint percentile latency. Average latency per endpoint hides the distribution. If your observability platform (Datadog, Prometheus + Grafana, New Relic) is not reporting p50, p95, and p99 per endpoint, you will see "average latency is fine" while a significant fraction of users are seeing slow responses.
Connection pool metrics. Most connection pool libraries expose metrics: current pool size, active connections, idle connections, wait time. If these are not in your dashboards, you will not know you have pool exhaustion until latency is already unacceptable.
Query plan change alerts. Some teams instrument their critical queries with query plan checks that run in CI. If the query plan for a critical query changes between releases, the check fails and flags it for review. This is defensive, not reactive.
When the Incident Happens Anyway
When a production performance incident occurs and the cause is not immediately obvious, the diagnosis path is:
- Identify the specific endpoints or operations with degraded latency using your p99 dashboards
- Check slow query logs for queries in those operations that are slower than expected
- Check connection pool metrics for saturation
- Check for recent deploys, data migrations, or schema changes in the preceding 48 hours
- Run
EXPLAIN ANALYZEon the slowest queries against the production database with real data
The most important thing you can do before a performance incident is ensure your slow query logging is enabled, your percentile dashboards exist, and your connection pool metrics are exported. These three things together will find the cause of 80% of production performance bugs in under 20 minutes. Without them, the same bugs take hours.