The Performance Bug That Only Appears Under Real Traffic

by Eric Hanson, Backend Developer at Clean Systems Consulting

Why Staging Misses It

You ran the load test. p99 was 180ms. You shipped. Three days after launch, p99 on the same endpoint climbs to 4 seconds on Monday mornings between 9 and 10am, then recovers.

Staging missed it because staging never had Monday morning. It never had the specific distribution of account ages, the specific mix of users with large data sets versus small ones, the specific cache hit ratio that results from thousands of users with overlapping access patterns. It never had the database's table statistics drifting as production data grows asymmetrically across partitions.

Certain performance bugs are genuinely only reproducible under real production conditions. The goal is not to prevent them from reaching production — that is often impossible. It is to find them fast and have the tooling to diagnose them when they appear.

The Classes of Production-Only Performance Bugs

Data skew. Your benchmark data is uniformly distributed. Production data is not. One tenant has 10,000 records; the median tenant has 40. A query that performs identically for every row in staging hits the 10,000-record tenant in production and spends 500ms. Without per-tenant query profiling, this is nearly invisible until a customer complains.

Cache cold starts after deploys. A deploy invalidates the cache. The first minute of traffic after a deploy is uncached. If your uncached response time is 1 second and your cached response time is 10ms, deploys produce a one-minute window where every user sees 100x slower responses. A load test that starts from warm cache will never catch this.

Query plan regression after statistics update. PostgreSQL and MySQL update table statistics periodically, and occasionally this causes the query planner to switch from an index scan to a seq scan for a query that was previously fast. The query was fine yesterday. It is slow today. Nothing in the code changed.

-- The query that was fast yesterday
SELECT * FROM orders
WHERE customer_id = $1
  AND status = 'pending'
ORDER BY created_at DESC
LIMIT 10;

-- Check if the plan changed
EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM orders
WHERE customer_id = 12345
  AND status = 'pending'
ORDER BY created_at DESC
LIMIT 10;

-- Look for: "Seq Scan" where you expect "Index Scan"
-- Or: rows estimate dramatically wrong (estimate: 5, actual: 50000)
-- Fix: ANALYZE orders; or hint the planner with SET enable_seqscan = off;

Connection pool saturation during traffic spikes. Your connection pool has 20 connections. Your service has 4 instances. That is 80 concurrent database connections. Under normal traffic, you use 30. During a traffic spike that triples requests, you need 90 connections, but the pool is capped at 80. Requests queue. Latency increases. Nothing is "broken" — the system is simply at capacity in a way that your steady-state tests never reached.

The Instrumentation That Makes These Findable

The difference between a production performance incident that takes 30 minutes to diagnose and one that takes 4 hours is almost entirely determined by what you are measuring.

Slow query logging. In PostgreSQL, log_min_duration_statement = 100 logs every query that takes over 100ms. In MySQL, slow_query_log with long_query_time = 0.1 does the same. These logs, aggregated in a tool like pgBadger or sent to your logging infrastructure, will surface the specific queries that are slow on production data.

Per-endpoint percentile latency. Average latency per endpoint hides the distribution. If your observability platform (Datadog, Prometheus + Grafana, New Relic) is not reporting p50, p95, and p99 per endpoint, you will see "average latency is fine" while a significant fraction of users are seeing slow responses.

Connection pool metrics. Most connection pool libraries expose metrics: current pool size, active connections, idle connections, wait time. If these are not in your dashboards, you will not know you have pool exhaustion until latency is already unacceptable.

Query plan change alerts. Some teams instrument their critical queries with query plan checks that run in CI. If the query plan for a critical query changes between releases, the check fails and flags it for review. This is defensive, not reactive.

When the Incident Happens Anyway

When a production performance incident occurs and the cause is not immediately obvious, the diagnosis path is:

  1. Identify the specific endpoints or operations with degraded latency using your p99 dashboards
  2. Check slow query logs for queries in those operations that are slower than expected
  3. Check connection pool metrics for saturation
  4. Check for recent deploys, data migrations, or schema changes in the preceding 48 hours
  5. Run EXPLAIN ANALYZE on the slowest queries against the production database with real data

The most important thing you can do before a performance incident is ensure your slow query logging is enabled, your percentile dashboards exist, and your connection pool metrics are exported. These three things together will find the cause of 80% of production performance bugs in under 20 minutes. Without them, the same bugs take hours.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

When Asynchronous Developers Are the Right Choice for Your Team

Not every team needs daily meetings and instant replies. Sometimes, the best work happens when people aren’t online at the same time.

Read more

Vertical Scaling vs Horizontal Scaling: When to Use Which

The choice between vertical and horizontal scaling is not philosophical — it is a function of your workload type, your budget, and what your system can actually do with more resources.

Read more

The Query That Works Fine Until It Doesn't

Some queries are correct at low volume and catastrophically wrong at scale — recognizing the structural patterns that make queries inherently fragile is what separates reactive firefighting from proactive engineering.

Read more

How to Write a Proposal That Gets a Response

Most contractor proposals are ignored not because the work is wrong for the client, but because the document is written for the contractor, not the reader.

Read more