Stop Guessing Why Your Query Is Slow. Use EXPLAIN.
by Eric Hanson, Backend Developer at Clean Systems Consulting
The optimization that didn't help
A query is running in 6 seconds. A developer looks at it, decides the users join is the problem, and adds an index on users.email. The query still runs in 6 seconds. They add another index. Still slow. They escalate.
An hour later, someone runs EXPLAIN ANALYZE and discovers the bottleneck is a sort on 2 million rows that's spilling to disk because work_mem is set to 4MB. The indexes were irrelevant. The problem was never the join.
This is the standard pattern when teams optimize without EXPLAIN. They act on intuition and get lucky or waste time. EXPLAIN takes 10 seconds to run and tells you exactly what the database is doing. Use it first, always.
EXPLAIN vs EXPLAIN ANALYZE: the critical difference
-- EXPLAIN: shows the plan the optimizer would use, WITHOUT executing the query
EXPLAIN SELECT * FROM orders WHERE user_id = 42;
-- EXPLAIN ANALYZE: actually executes the query and shows real timings and row counts
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 42;
EXPLAIN alone shows estimated costs and rows. It does not tell you how long the query actually took or how many rows were actually returned. For diagnosis, this is almost never enough.
EXPLAIN ANALYZE executes the query. For a slow SELECT, this means you're running the slow query twice — once to diagnose. That's fine and necessary. For destructive queries (UPDATE, DELETE), wrap in a transaction you roll back:
BEGIN;
EXPLAIN ANALYZE DELETE FROM orders WHERE created_at < '2023-01-01';
ROLLBACK;
The full diagnostic command
For PostgreSQL, this is what you should run:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT
o.id,
u.name,
SUM(oi.quantity * oi.unit_price) AS total
FROM orders o
JOIN users u ON o.user_id = u.id
JOIN order_items oi ON o.id = oi.order_id
WHERE o.created_at > NOW() - INTERVAL '30 days'
AND o.status = 'completed'
GROUP BY o.id, u.name
ORDER BY total DESC;
The BUFFERS option adds buffer cache statistics — how many pages were read from cache (shared hit) vs disk (shared read). High shared reads on a hot query is a sign of cache pressure.
For MySQL:
EXPLAIN FORMAT=JSON
SELECT ...;
MySQL's JSON format is more information-dense than the default tabular output. Look for "access_type": "ALL" (full table scan) and "rows_examined_per_scan" for each table.
Reading the output: the one-minute version
Start at the bottom of the plan (the leaf nodes — where data is read) and work up:
Sort (actual time=3821.43..3894.23 rows=289341 width=72)
Sort Key: (sum((oi.quantity * oi.unit_price))) DESC
Sort Method: external merge Disk: 28640kB ← SPILLING TO DISK
-> HashAggregate (actual time=2341.12..2589.44 rows=289341)
-> Hash Join (actual time=234.33..1923.44 rows=1247891)
-> Seq Scan on order_items oi
-> Hash
-> Index Scan on orders o
Filter: (status = 'completed' AND created_at > ...)
Rows Removed by Filter: 1823441 ← POOR SELECTIVITY
Two immediate findings:
-
Sort Method: external merge Disk: 28640kB— the sort is spilling 28MB to disk. Fix: increasework_memfor this session (SET work_mem = '256MB'before the query). -
Rows Removed by Filter: 1823441on the index scan — the index onordersis being used, but after following index entries, 1.8 million rows are being discarded by the filter. The index might not be selective enough for this query, or a composite index covering(status, created_at)would reduce the rows fetched from the heap.
What each finding points to
| Finding | Likely cause | Fix |
|---|---|---|
| Seq Scan on large table | Missing index, or function on column | Add index, rewrite filter |
| High rows estimate vs actual | Stale statistics | ANALYZE <table> |
| External merge disk | Insufficient work_mem | Increase work_mem |
| Nested Loop + large outer | Wrong join type chosen | Check statistics, consider enable_nestloop = off temporarily to test |
| High shared read vs hit | Working set not in buffer pool | Increase shared_buffers, add caching layer |
| Index scan with many filtered rows | Index not selective enough | Composite index or partial index |
The habit to build
Before you write an index migration, before you refactor a query, before you escalate a slow query report — run EXPLAIN (ANALYZE, BUFFERS). Read the actual row counts. Find the node with the highest actual time. Fix that node. Run the plan again to confirm improvement. This process, done consistently, catches 90% of query performance problems at the root cause, not at the symptom.