The SQL Mistake That Looks Correct But Returns Wrong Data

by Eric Hanson, Backend Developer at Clean Systems Consulting

The query that passes review and corrupts your reports

A billing report has been running in production for eight months. Finance notices the totals are slightly off. Someone looks at the query, spots a LEFT JOIN that's multiplying rows before the SUM, and realizes the revenue figures have been overstated by 12-15% depending on the month. The query looked right. The results looked plausible. Nobody caught it until the numbers were audited.

SQL correctness bugs are worse than performance bugs because they're silent. A slow query is obvious. A query that returns wrong data can run for months before anyone notices — especially for aggregations where "plausible" is not the same as "correct."

Mistake 1: Aggregating after a multiplying JOIN

This is the most common SQL correctness bug in reporting queries:

-- Orders table: one row per order
-- Order items table: multiple rows per order
-- Tags table: multiple rows per order (order can have multiple tags)

SELECT
  o.id,
  SUM(oi.amount) AS order_total
FROM orders o
JOIN order_items oi ON o.id = oi.order_id
JOIN order_tags ot ON o.id = ot.order_id   -- This multiplies rows!
WHERE o.created_at >= '2024-01-01'
GROUP BY o.id;

If an order has 3 items and 4 tags, the join produces 12 rows for that order (3 × 4). SUM(oi.amount) adds each item amount 4 times — once per tag. The total is 4x too large.

The fix: aggregate before joining, or use subqueries to pre-aggregate:

-- Aggregate items before joining to tags
WITH order_totals AS (
  SELECT order_id, SUM(amount) AS total
  FROM order_items
  GROUP BY order_id
)
SELECT
  o.id,
  ot_agg.total AS order_total,
  ARRAY_AGG(t.name) AS tags
FROM orders o
JOIN order_totals ot_agg ON o.id = ot_agg.order_id
LEFT JOIN order_tags otag ON o.id = otag.order_id
LEFT JOIN tags t ON otag.tag_id = t.id
WHERE o.created_at >= '2024-01-01'
GROUP BY o.id, ot_agg.total;

Whenever you have a one-to-many JOIN combined with an aggregation, trace the cardinality before writing the GROUP BY. Ask: how many rows does each join produce per parent row? If the answer is "many," you need to aggregate before or after the join, not during.

Mistake 2: INNER JOIN silently excluding records

-- Intended: total revenue for all users, including users with no orders
SELECT u.id, u.name, SUM(o.total) AS revenue
FROM users u
JOIN orders o ON u.id = o.user_id   -- INNER JOIN excludes users with zero orders
GROUP BY u.id, u.name;

Users with no orders are dropped from the result. If you're building a "revenue by user including $0" report, this is silently wrong. The fix is a LEFT JOIN:

SELECT u.id, u.name, COALESCE(SUM(o.total), 0) AS revenue
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.id, u.name;

COALESCE(SUM(o.total), 0) converts the NULL that SUM returns for an all-NULL group into 0.

The bug is subtle because the query still returns correct data for users who have orders — the error only manifests for users with no orders. If all your test users have orders (as they usually do in dev/staging), you'll never see it.

Mistake 3: WHERE filters that should be ON conditions in outer joins

-- Intended: all orders, with discount info where available (discount is optional)
SELECT o.id, o.total, d.discount_amount
FROM orders o
LEFT JOIN discounts d ON o.id = d.order_id
WHERE d.campaign_id = 5;   -- This converts the LEFT JOIN to an INNER JOIN!

The LEFT JOIN returns NULL for d columns when no discount exists. But the WHERE clause d.campaign_id = 5 filters out rows where d.campaign_id IS NULL — which includes all rows with no discount. The result is identical to an INNER JOIN.

When filtering on a LEFT-joined table's columns, the filter belongs in the ON clause, not WHERE:

-- Only bring in discounts from campaign 5 (still returns all orders)
SELECT o.id, o.total, d.discount_amount
FROM orders o
LEFT JOIN discounts d ON o.id = d.order_id AND d.campaign_id = 5;

Now orders without a campaign-5 discount still appear, with NULL for discount_amount.

Mistake 4: DISTINCT as a band-aid for a bad join

-- Slow, and masks the real problem
SELECT DISTINCT u.id, u.name
FROM users u
JOIN user_roles ur ON u.id = ur.user_id
JOIN permissions p ON ur.role_id = p.role_id
WHERE p.name = 'admin';

If DISTINCT is the only thing preventing duplicate users, there's a fan-out in the join (a user with multiple roles, each with the 'admin' permission, appears multiple times). DISTINCT hides this, but it also does a full sort or hash deduplication on the entire result set.

The explicit fix:

-- Use EXISTS or a subquery to check permission without fan-out
SELECT u.id, u.name
FROM users u
WHERE EXISTS (
  SELECT 1 FROM user_roles ur
  JOIN permissions p ON ur.role_id = p.role_id
  WHERE ur.user_id = u.id AND p.name = 'admin'
);

EXISTS short-circuits on the first match — no fan-out, no deduplication needed.

Mistake 5: COUNT(*) vs COUNT(column) confusion

SELECT COUNT(*) AS total, COUNT(cancelled_at) AS cancelled_count
FROM orders;

COUNT(*) counts all rows including NULLs. COUNT(column) counts only rows where that column is NOT NULL. If cancelled_at is NULL for non-cancelled orders, COUNT(cancelled_at) correctly counts only cancelled orders. This is often the intended behavior — but developers who expect COUNT(column) to work like COUNT(*) get confused when the numbers differ.

The explicit and readable equivalent:

SELECT
  COUNT(*) AS total_orders,
  COUNT(*) FILTER (WHERE cancelled_at IS NOT NULL) AS cancelled_orders,
  COUNT(*) FILTER (WHERE cancelled_at IS NULL) AS active_orders
FROM orders;

The verification habit

For any reporting query, generate a small test dataset where the expected result is manually calculable. Run the query. Compare. Pay special attention to edge cases: records with no children in a JOIN, records that match multiple join conditions, NULL values in filtered columns. If the query involves aggregation over multiple joined tables, trace the row cardinality through every join step before trusting the output.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

How Much Does a Backend Contractor Actually Cost vs a Full-Time Hire — A Brutally Honest Breakdown

The day rate looks expensive. The full-time salary looks cheaper. Neither comparison is complete until you account for what each model actually costs to produce a shipped feature.

Read more

A Good API Is One Developers Never Have to Ask Questions About

APIs fail when they require interpretation instead of execution. The best APIs eliminate ambiguity through consistent design, predictable behavior, and self-evident contracts.

Read more

Caching Strategies Compared — In-Memory, Redis, and CDN: When to Use Each

Caching is not a single tool — in-memory, Redis, and CDN caches have different invalidation models, latency profiles, and failure modes that determine where each belongs in your stack.

Read more

How to Write a Pull Request That People Actually Want to Review

A well-written PR description reduces review time, improves feedback quality, and gets merged faster — not because reviewers are lazy, but because good context makes good review possible.

Read more