Stop Writing Subqueries When a JOIN Will Do
by Eric Hanson, Backend Developer at Clean Systems Consulting
When the intuitive approach is the slow one
You need to find all users who have placed an order in the last 30 days. You write:
SELECT * FROM users
WHERE id IN (
SELECT user_id FROM orders WHERE created_at > NOW() - INTERVAL '30 days'
);
It returns correct results. Code review passes. Six months later, with 5 million users and 20 million orders, it's running for 8 seconds. A JOIN rewrites it to run in 80ms. The logic is identical. The execution plan is not.
This pattern — reaching for a subquery when a JOIN is the right tool — is one of the most common SQL performance mistakes I see in production codebases. It's not universal; subqueries have legitimate uses. But developers who learned SQL through intuition rather than execution model tend to overuse them.
What the database actually does
A subquery in a WHERE ... IN (...) clause can be executed in two ways by the query optimizer:
- Execute once, hash the results, probe the hash for each outer row — essentially a hash join. Modern optimizers (PostgreSQL 7.4+, MySQL 8.0+) will often do this automatically.
- Execute the subquery once per outer row — a correlated execution model. This is what happens with correlated subqueries and some uncorrelated ones on older planners.
The second model is catastrophic at scale. If your outer query returns 1 million rows, the subquery runs 1 million times.
The JOIN approach:
SELECT DISTINCT u.*
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.created_at > NOW() - INTERVAL '30 days';
Modern optimizers will execute this as a hash join or merge join — fundamentally the same execution as the optimized subquery form. But the JOIN makes the relationship explicit and gives the optimizer more freedom to choose the best strategy.
WHERE to use subqueries vs JOINs
Use a JOIN when:
- You're filtering based on existence of related rows
- You need columns from the related table in the SELECT
- You're doing aggregations that combine data from multiple tables
Use a subquery (specifically a correlated subquery) when:
- You need a value that depends on each outer row individually
- You're using
EXISTSwith an early-exit condition - The subquery result can't be expressed as a flat join
-- EXISTS with correlated subquery is appropriate and efficient
-- The database stops scanning inner rows as soon as one match is found
SELECT u.id, u.name
FROM users u
WHERE EXISTS (
SELECT 1 FROM orders o
WHERE o.user_id = u.id
AND o.status = 'completed'
AND o.created_at > NOW() - INTERVAL '30 days'
);
EXISTS is often faster than IN with a subquery for this pattern because it short-circuits on the first match. A JOIN with DISTINCT achieves the same result but reads all matching rows before deduplicating.
The correlated subquery trap
This is the form that truly destroys performance:
-- Executes the subquery once per row in orders
SELECT
o.id,
o.total,
(SELECT name FROM users WHERE id = o.user_id) AS user_name
FROM orders o;
At 1 million orders, this is 1 million SELECT statements against the users table. Even with an index on users.id, the overhead of 1 million round trips to the index is significant. The JOIN version:
SELECT
o.id,
o.total,
u.name AS user_name
FROM orders o
JOIN users u ON o.user_id = u.id;
This does one join operation across both tables, not N separate lookups. On a dataset of 1 million orders with an indexed user_id, the JOIN version will outperform the correlated subquery by an order of magnitude or more in most engines.
When subqueries are legitimately better
Scalar subqueries for derived values:
SELECT
p.name,
p.price,
(SELECT AVG(price) FROM products WHERE category_id = p.category_id) AS category_avg_price
FROM products p;
This is a correlated subquery, so it has performance implications. But if you need the category average relative to each row, a window function is actually the better tool:
SELECT
name,
price,
AVG(price) OVER (PARTITION BY category_id) AS category_avg_price
FROM products;
Subqueries to limit what gets joined:
-- Pre-filter orders before joining to reduce join size
SELECT u.name, recent.total
FROM users u
JOIN (
SELECT user_id, SUM(total) AS total
FROM orders
WHERE created_at > NOW() - INTERVAL '90 days'
GROUP BY user_id
) recent ON u.id = recent.user_id;
This is a derived table (inline view), not a correlated subquery. It aggregates before joining, which can significantly reduce the number of rows the join has to process.
The practical test
When you write a subquery, ask two questions: Is this correlated (does the subquery reference the outer query)? And does the subquery return a set of IDs you're filtering by? If yes to either, check whether a JOIN or EXISTS produces the same result. Then run EXPLAIN ANALYZE on both and compare the actual row counts and execution time. Don't trust the optimizer to always make the right choice — verify it.