Aggregations Are Not Just COUNT and SUM. There Is More.
by Arif Ikhsanudin, Backend Developer
The Python script that SQL could replace
Your data team has a Python script that pulls revenue data, computes percentiles, builds histograms, calculates correlation coefficients, and aggregates tags into comma-separated strings. It runs for 8 minutes because it loads 10 million rows into pandas. Most of what this script does can be done directly in PostgreSQL, in a fraction of the time, with no data transfer.
Developers default to COUNT, SUM, AVG, MIN, MAX because those are the functions in every SQL tutorial. Modern databases — particularly PostgreSQL — have a much richer aggregation library.
Statistical aggregations
PostgreSQL provides a full suite of statistical functions usable as aggregates:
SELECT
product_category,
COUNT(*) AS order_count,
AVG(order_value) AS mean_order_value,
STDDEV(order_value) AS stddev,
VARIANCE(order_value) AS variance,
-- Skewness and excess kurtosis (PostgreSQL-specific)
-- via extension or custom aggregates
-- Correlation between order value and quantity
CORR(order_value, quantity) AS value_qty_correlation,
-- Covariance
COVAR_POP(order_value, quantity) AS covariance,
-- Regression: slope and intercept of order_value vs quantity
REGR_SLOPE(order_value, quantity) AS regression_slope,
REGR_INTERCEPT(order_value, quantity) AS regression_intercept,
REGR_R2(order_value, quantity) AS r_squared
FROM orders
WHERE created_at >= NOW() - INTERVAL '90 days'
GROUP BY product_category;
CORR, COVAR_POP, and REGR_* functions implement standard statistical calculations over grouped data. They're computed in a single pass over the table — no need to load data into application memory.
Percentile and ordered-set aggregates
Standard AVG doesn't tell you about distribution. Median and percentiles do:
SELECT
product_category,
-- Exact percentiles (sorts all values — can be memory-intensive)
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY order_value) AS p25,
PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY order_value) AS median,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY order_value) AS p75,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY order_value) AS p95,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY order_value) AS p99,
-- Discrete percentile (returns an actual data value, not interpolated)
PERCENTILE_DISC(0.50) WITHIN GROUP (ORDER BY order_value) AS median_actual
FROM orders
GROUP BY product_category;
PERCENTILE_CONT interpolates between values. PERCENTILE_DISC returns the nearest actual data value. Both use WITHIN GROUP (ORDER BY ...) — the "ordered-set aggregate" syntax that specifies how to sort the values for percentile calculation.
For approximate percentiles on very large datasets (hundreds of millions of rows), PostgreSQL's percentile_cont requires sorting all values, which is expensive. Extensions like pg_tdigest provide approximate percentile calculation in O(n) time and constant memory.
MODE: most frequent value
SELECT
product_category,
MODE() WITHIN GROUP (ORDER BY payment_method) AS most_common_payment
FROM orders
GROUP BY product_category;
MODE() returns the most frequently occurring value in the ordered set. Equivalent to a GROUP BY subquery with ORDER BY count DESC LIMIT 1 per group, but expressed as a single aggregate.
String aggregation
Aggregating strings (collecting values from multiple rows into a single delimited string) is one of the most common requirements that developers implement in application code unnecessarily:
-- PostgreSQL: STRING_AGG
SELECT
order_id,
STRING_AGG(product_name, ', ' ORDER BY product_name) AS products_ordered
FROM order_items
JOIN products ON order_items.product_id = products.id
GROUP BY order_id;
-- MySQL: GROUP_CONCAT (with ORDER BY and SEPARATOR)
SELECT
order_id,
GROUP_CONCAT(product_name ORDER BY product_name SEPARATOR ', ') AS products_ordered
FROM order_items
JOIN products ON order_items.product_id = products.id
GROUP BY order_id;
This returns one row per order with a comma-separated list of product names. Without STRING_AGG, you'd fetch all order items and loop in application code.
Array aggregation (PostgreSQL)
For a more structured result that the application can process without parsing delimited strings:
SELECT
order_id,
ARRAY_AGG(product_id ORDER BY product_id) AS product_ids,
ARRAY_AGG(product_name ORDER BY product_name) AS product_names
FROM order_items
JOIN products ON order_items.product_id = products.id
GROUP BY order_id;
Returns PostgreSQL arrays, which ORM drivers typically map to native arrays in the application language.
JSON aggregation
When the result needs to be structured data:
SELECT
order_id,
JSON_AGG(
JSON_BUILD_OBJECT(
'product_id', oi.product_id,
'product_name', p.name,
'quantity', oi.quantity,
'unit_price', oi.unit_price
)
ORDER BY p.name
) AS line_items
FROM order_items oi
JOIN products p ON oi.product_id = p.id
GROUP BY order_id;
Returns one row per order with a JSON array of line items. This is an effective pattern for fetching nested data in one query rather than N+1 queries.
The practical test
Look at any post-processing code in your application that: computes percentiles or medians, builds comma-separated or JSON-serialized strings from multiple rows, calculates correlation or regression, or assembles nested structures from flat query results. Each of these is a candidate for a SQL aggregate. Test it with EXPLAIN ANALYZE to verify the query plan is sane, then benchmark against the application-side equivalent.