Your SQL Query Works. But It Won't When Your Data Grows.

by Arif Ikhsanudin, Backend Developer

The query that passes review and kills production

You ship a feature. The query runs in under 100ms in staging. Six months later, ops is paging you at 2am because that same query is holding a full table lock on 40 million rows. This is not a hypothetical. It is one of the most predictable failure patterns in backend development, and it keeps happening because developers test with small datasets and reason about correctness instead of cost.

The problem isn't that your SQL is wrong. It's that SQL's cost structure is non-linear. A query that does a full table scan on 10,000 rows costs almost nothing. The same query on 10,000,000 rows is a disaster. And the relationship between row count and query time is not always proportional — it can be exponential once you introduce joins, sorts, and missing indexes.

What actually changes when data grows

Three things kill query performance at scale, and they compound each other:

1. Full table scans become unacceptable

-- Looks innocent. Runs fine at 50k rows. Brutal at 50M.
SELECT * FROM orders WHERE status = 'pending' AND created_at > '2024-01-01';

Without a composite index on (status, created_at), the database reads every row. At 50 million rows on a table with 20 columns, that's potentially gigabytes of I/O for a query that returns 200 rows.

2. JOIN cardinality blows up

SELECT u.name, COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.id;

If users has 1M rows and orders has 50M, the database has to assemble a 50M-row intermediate result before grouping. The query plan matters enormously here — a hash join behaves differently than a nested loop join, and the optimizer won't always pick correctly without proper statistics and indexes.

3. Implicit sorts destroy throughput

ORDER BY, GROUP BY, DISTINCT, and window functions all require sorting unless the data is already ordered on disk (via a clustered index or covering index). Sorting 10M rows in a temp buffer is slow. Sorting 10M rows that spill to disk is catastrophic.

The patterns that age poorly

SELECT * in application queries

-- Don't do this in application code
SELECT * FROM products WHERE category_id = 42;

You're pulling every column — including description TEXT, metadata JSONB, and image_data BYTEA — for a list view that needs only id, name, and price. At 10 rows this is fine. At 10,000 rows per page load, you're moving megabytes of data you immediately discard.

LIKE with a leading wildcard

-- Index on `email` is useless here
SELECT * FROM users WHERE email LIKE '%@example.com';

Any index on email is bypassed entirely. If you need suffix search, use a reverse index or a full-text search engine like PostgreSQL's tsvector / GIN indexes.

Correlated subqueries in SELECT

SELECT
  o.id,
  (SELECT COUNT(*) FROM order_items WHERE order_id = o.id) AS item_count
FROM orders o;

This executes the subquery once per row in orders. At 1M orders, that's 1M additional queries. Rewrite it as a JOIN with aggregation or use a window function.

How to write for scale from the start

Qualify your WHERE clauses to use indexes

Write your query, then ask: does each filter column have an index, and does the filter allow the database to use it? Leading wildcards, function calls on indexed columns (WHERE YEAR(created_at) = 2024), and implicit type coercions all defeat indexes.

-- Index-hostile
WHERE YEAR(created_at) = 2024

-- Index-friendly
WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01'

Select only what you need

Enumerate columns explicitly in application queries. Not just for performance — it also protects you when columns are added or removed.

Paginate aggressively

Never return unbounded result sets from application queries. Use LIMIT and OFFSET, or better, keyset pagination (also called cursor-based pagination) which doesn't degrade at deep offsets:

-- Offset pagination degrades at large offsets (reads and discards N rows)
SELECT * FROM events ORDER BY id LIMIT 100 OFFSET 50000;

-- Keyset pagination is O(1) regardless of position
SELECT * FROM events WHERE id > :last_seen_id ORDER BY id LIMIT 100;

Test with production-scale data

This is the only real fix. Seed your staging environment with realistic data volumes. Use EXPLAIN ANALYZE (PostgreSQL) or EXPLAIN FORMAT=JSON (MySQL) against that data, not against a 5,000-row dev fixture.

What to do today

Pick one query in your application that touches a table expected to grow — orders, events, logs, user activity. Run EXPLAIN ANALYZE on it against your largest dataset. Look at the rows estimate vs actual, and check whether it's doing a Seq Scan where you'd expect an Index Scan. If the estimates are way off, your table statistics are stale — run ANALYZE (PostgreSQL) or ANALYZE TABLE (MySQL) and check again. That single habit, applied regularly, will catch the majority of scale problems before they reach production.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

API Versioning Is Not Optional Once You Have Real Users

Once an API has real consumers, every change becomes a contract risk. Versioning is the only reliable way to evolve safely without breaking production systems.

Read more

The Difference Between a Fast Test Suite and a Useful Test Suite

Optimizing test suite speed without examining test value produces a fast suite that catches nothing important. Speed and usefulness are both necessary — but they require different investments and are often in tension.

Read more

Recovering From a Public Mistake (Like a Website Crash)

Seeing your website go down in front of everyone is a stomach-dropping moment. But a public mistake doesn’t have to be a career-ender—it can be a chance to show professionalism and resilience.

Read more

Caching Docker Layers in CI/CD to Stop Waiting Forever

CI pipelines rebuild Docker images from scratch because they start with a clean environment every run. Registry-based layer caching gives your pipeline the same cache hits you get locally, often cutting build time by 70% or more.

Read more