Stop Writing Loops When SQL Aggregations Can Do the Work
by Eric Hanson, Backend Developer at Clean Systems Consulting
The loop that shouldn't exist
You're reading through a service that calculates monthly revenue by product category. The code fetches all orders for a date range (potentially millions of rows), loads them into memory, and iterates through them in a nested loop: for each category, sum all orders in that category. It works. It uses 2GB of memory on a large dataset and takes 45 seconds. A single SQL aggregation query does the same thing in 300ms.
Aggregating in application code instead of in SQL is one of the most common performance mistakes in data-heavy systems. It's not a subtle mistake — it's 100-1000x slower at scale, uses unbounded memory, and produces code that's harder to read than the equivalent SQL. The reason it persists is that developers reach for Python loops or Java streams before asking whether the database can do it.
What the database can do that your loop can't
A SQL aggregation runs inside the database engine, close to the data. It:
- Reads only the columns it needs, not entire rows
- Applies filters before loading any data into memory (versus loading rows then filtering in code)
- Uses specialized sort and hash algorithms optimized for aggregation at scale
- Can use partial aggregation — aggregate per-partition, then combine — exploiting available parallelism
- Returns one row per group instead of N rows per group
A loop in Python, Java, or Ruby:
- Transfers all matching rows across the network to the application server
- Holds all rows in application memory simultaneously
- Uses general-purpose data structures (hash maps, lists) for grouping
- Serializes/deserializes every row (type conversion overhead)
The performance difference at 10 million rows is not marginal. It's order-of-magnitude.
The pattern to replace
# Wrong: fetch all, aggregate in Python
def monthly_revenue_by_category(start_date, end_date):
orders = db.query(
"SELECT category_id, total FROM orders WHERE created_at BETWEEN %s AND %s",
start_date, end_date
)
revenue = {}
for order in orders:
category = order['category_id']
revenue[category] = revenue.get(category, 0) + order['total']
return revenue
-- Right: aggregate in the database
SELECT
category_id,
SUM(total) AS revenue
FROM orders
WHERE created_at BETWEEN :start_date AND :end_date
GROUP BY category_id;
The SQL version transfers one row per category instead of one row per order. If there are 5 million orders across 100 categories, you're transferring 100 rows instead of 5 million. This is the primary win.
Common aggregation functions you might be underusing
Beyond SUM and COUNT:
-- Revenue stats per category in one query
SELECT
category_id,
COUNT(*) AS order_count,
SUM(total) AS total_revenue,
AVG(total) AS avg_order_value,
MIN(total) AS min_order,
MAX(total) AS max_order,
PERCENTILE_CONT(0.5)
WITHIN GROUP (ORDER BY total) AS median_order, -- PostgreSQL
STDDEV(total) AS revenue_stddev
FROM orders
WHERE created_at >= NOW() - INTERVAL '30 days'
GROUP BY category_id;
This returns distribution statistics per category in a single query. The equivalent in Python requires loading all orders, grouping manually, then computing percentile via statistics.median() which requires a sorted list per group.
Conditional aggregation: the CASE WHEN inside COUNT
You need active vs inactive user counts per plan. The naive approach: two queries. The SQL approach: one query with conditional aggregation.
-- Two queries (wrong)
SELECT plan_id, COUNT(*) FROM users WHERE status = 'active' GROUP BY plan_id;
SELECT plan_id, COUNT(*) FROM users WHERE status = 'inactive' GROUP BY plan_id;
-- One query with conditional aggregation (right)
SELECT
plan_id,
COUNT(*) FILTER (WHERE status = 'active') AS active_users, -- PostgreSQL
COUNT(*) FILTER (WHERE status = 'inactive') AS inactive_users,
COUNT(*) AS total_users
FROM users
GROUP BY plan_id;
-- ANSI SQL equivalent using CASE WHEN (works in MySQL, SQL Server)
SELECT
plan_id,
COUNT(CASE WHEN status = 'active' THEN 1 END) AS active_users,
COUNT(CASE WHEN status = 'inactive' THEN 1 END) AS inactive_users,
COUNT(*) AS total_users
FROM users
GROUP BY plan_id;
One table scan, one pass, one result set. Two separate queries scan the table twice.
When to aggregate in application code
Aggregation in the database is not always the answer:
- Complex business logic that can't be expressed in SQL: multi-step calculations with conditional branching based on intermediate results. Push this to the application, but fetch pre-aggregated data from the database.
- Cross-database aggregation: combining data from multiple database systems (e.g., PostgreSQL + Redis + a third-party API). Aggregate at the application tier after fetching from each source.
- Real-time streaming aggregations: for aggregations over a live event stream, a stream processor (Apache Flink, Kafka Streams) is more appropriate than a database query.
The default should be SQL aggregation. Application-side aggregation should be a conscious choice with a clear reason — not the path of least resistance.
The immediate test
Look at your codebase for any loop that iterates over database query results and accumulates values (sum, count, group, max, min). Each of those loops is a candidate for replacement with a SQL aggregation. Pick the one on the largest dataset and benchmark both approaches. The result will make the case for the rest.