The Query That Works Fine Until It Doesn't
by Eric Hanson, Backend Developer at Clean Systems Consulting
The query that graduates from slow to incident
You've seen this timeline: a query ships, works fine, gets forgotten. Then at some threshold — 1 million rows, a traffic spike, a data import that triples the table size — it becomes the bottleneck behind a P0 incident. You look at the query and nothing about it obviously screams "slow." It looks like any other query. The problem is structural.
Certain query patterns are inherently time-bombs. They're not wrong. They're not buggy. They just depend on data volume staying small, and that assumption is never documented. Here's how to recognize and defuse them before they detonate.
Pattern 1: The unbounded aggregation
SELECT user_id, SUM(amount) AS lifetime_value
FROM transactions
GROUP BY user_id;
This aggregates every row in transactions every time it runs. At 100k rows, 10ms. At 100M rows, called from a dashboard that auto-refreshes every 30 seconds — production is down.
The fix is pre-computation. Maintain a user_lifetime_value table updated incrementally via triggers or an event-driven pipeline. The query becomes:
SELECT user_id, lifetime_value FROM user_lifetime_value WHERE user_id = :id;
Alternatively, use materialized views (PostgreSQL, Oracle) with a refresh schedule that matches your tolerance for staleness:
CREATE MATERIALIZED VIEW user_ltv AS
SELECT user_id, SUM(amount) AS lifetime_value
FROM transactions
GROUP BY user_id;
-- Refresh on a schedule, not per request
REFRESH MATERIALIZED VIEW CONCURRENTLY user_ltv;
The CONCURRENTLY flag in PostgreSQL allows reads during refresh — without it, the view is locked during the refresh operation.
Pattern 2: The offset that gets deeper over time
-- Page 1: fast
SELECT * FROM events ORDER BY created_at DESC LIMIT 50 OFFSET 0;
-- Page 10,000: slow — reads and discards 500,000 rows
SELECT * FROM events ORDER BY created_at DESC LIMIT 50 OFFSET 500000;
OFFSET tells the database to skip N rows. To skip them, it has to read them first. At deep offsets this is a full or near-full table scan regardless of your LIMIT. This degrades predictably as your table grows and as users paginate deeper.
Keyset pagination eliminates this:
-- First page
SELECT id, title, created_at FROM events ORDER BY created_at DESC, id DESC LIMIT 50;
-- Next page (pass last seen values from previous page)
SELECT id, title, created_at FROM events
WHERE (created_at, id) < (:last_created_at, :last_id)
ORDER BY created_at DESC, id DESC
LIMIT 50;
Requires a composite index on (created_at DESC, id DESC). Works in constant time regardless of page depth.
Pattern 3: The filter that relies on function application
SELECT * FROM orders WHERE EXTRACT(YEAR FROM created_at) = 2024;
Wrapping a column in a function defeats any index on that column. The database has to compute EXTRACT(YEAR FROM created_at) for every single row before it can apply the filter.
-- Index-friendly equivalent
SELECT * FROM orders
WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01';
If you have an index on created_at, this uses it. The first form does not. This pattern shows up constantly with date functions, type casts, and string operations:
-- Index-hostile
WHERE LOWER(email) = 'user@example.com'
WHERE CAST(user_id AS TEXT) = '42'
WHERE DATE(created_at) = CURRENT_DATE
-- Index-friendly alternatives
WHERE email = LOWER('User@Example.com') -- normalize at write time
WHERE user_id = 42 -- fix the type mismatch upstream
WHERE created_at >= CURRENT_DATE AND created_at < CURRENT_DATE + INTERVAL '1 day'
Pattern 4: The JOIN that multiplies rows unexpectedly
SELECT u.id, u.name, t.tag_name
FROM users u
JOIN user_tags ut ON u.id = ut.user_id
JOIN tags t ON ut.tag_id = t.id
WHERE u.status = 'active';
If a user has 50 tags, they appear 50 times in the result. If the calling code treats each row as a distinct user, you now have a bug that silently corrupts aggregations. And if you add a GROUP BY to fix it without understanding the join, you might accidentally aggregate across users.
Always trace the cardinality through your join chain before writing the query. Ask: for each row in the outer table, how many rows will the inner join produce? If the answer is "potentially many," decide upfront whether you want the fan-out or need to aggregate it away.
Pattern 5: The query with no LIMIT in a loop
# Called once per API request, returns all matching records
def get_user_events(user_id):
return db.query("SELECT * FROM events WHERE user_id = ?", user_id)
For most users, this returns 20 rows. For power users with years of history, it returns 50,000. Your application wasn't designed to handle 50,000 rows per request, your web server's memory wasn't allocated for it, and your API client certainly didn't expect it.
Always LIMIT queries that run in request contexts. Even if you "know" the user will have few records, protect against the outlier with an explicit ceiling.
The discipline
Before shipping any query that touches a table that grows over time, ask: what happens to this query when the table is 100x larger? If the answer is "it gets 100x slower," that's a structural problem you need to solve now, not at incident time. The cost of thinking through this at review time is zero. The cost of an unplanned scale incident is not.