Stop Storing Everything in One Table. Normalization Exists for a Reason.
by Eric Hanson, Backend Developer at Clean Systems Consulting
The single-table trap
You're building an invoicing system. Time pressure is real, so you create one table:
CREATE TABLE invoices (
id BIGINT PRIMARY KEY,
invoice_number VARCHAR(50),
customer_name VARCHAR(255),
customer_email VARCHAR(255),
customer_address TEXT,
item_description TEXT,
item_quantity INT,
item_unit_price DECIMAL(10,2),
item_total DECIMAL(10,2),
invoice_date DATE,
due_date DATE,
status VARCHAR(50)
);
This works. Until the customer changes their email and you have to update 400 invoice rows. Or until you realize that one invoice has multiple line items, and you're storing comma-separated descriptions in item_description. Or until you need to query "all invoices with at least one item over $100" and you're doing substring parsing in SQL.
These are update anomalies and structural limitations — the exact problems normalization was designed to prevent.
What normalization actually means
Normalization is the process of organizing a schema so that each fact is stored in exactly one place. It's defined by "normal forms" — each higher form eliminates a specific class of anomaly.
First Normal Form (1NF): Each column contains a single atomic value. No repeating groups, no comma-separated lists.
-- Violates 1NF: multiple values in one column
item_descriptions TEXT -- "Widget A, Widget B, Widget C"
-- Satisfies 1NF: one row per item
CREATE TABLE invoice_items (
id BIGINT PRIMARY KEY,
invoice_id BIGINT REFERENCES invoices(id),
description TEXT,
quantity INT,
unit_price DECIMAL(10,2)
);
Second Normal Form (2NF): Every non-key column depends on the entire primary key. Relevant when you have composite primary keys.
If you have a table order_products(order_id, product_id, product_name, quantity), then product_name depends only on product_id, not on the full composite key (order_id, product_id). This is a partial dependency — violates 2NF. Fix it by extracting products(id, name).
Third Normal Form (3NF): No non-key column depends on another non-key column (no transitive dependencies).
-- Violates 3NF: zip_code determines city and state
orders(id, zip_code, city, state, total)
-- Satisfies 3NF: city/state stored once, referenced by zip
zip_codes(zip_code PRIMARY KEY, city, state)
orders(id, zip_code REFERENCES zip_codes, total)
In practice, strictly enforcing 3NF for things like zip codes is often overkill — you'd need to maintain a reference table for every derived value. Use judgment. The principle matters: avoid storing the same fact in multiple places.
The cost of denormalization
Update anomalies: A customer changes their address. If the address is stored in every order row, you need to update thousands of rows atomically. Miss one, and your data is inconsistent.
Redundant storage: Customer name and email stored in every invoice row. If you have 1 million invoices for 10,000 customers, you're storing each customer's name 100 times on average. This inflates table size, bloats backups, and wastes buffer pool space.
Query complexity: "Find all customers who changed their email since their first order" becomes a self-join or a subquery nightmare when emails are stored per-order. It's a trivial query when emails live in a customers table with an audit log.
When denormalization is actually correct
Normalization is not an absolute rule. Denormalization is sometimes the right choice, but it should be a deliberate architectural decision, not a default.
Read-heavy reporting tables: A data warehouse fact table storing pre-computed denormalized records for analytics queries is appropriate. You're trading write overhead (ETL processes) for read simplicity (no joins in analytics queries).
Performance-critical hot paths: If a query runs 100,000 times per second and the normalized version requires three joins, adding a redundant denormalized column with explicit application-level update logic may be warranted. Document this explicitly — it's technical debt with a known cost.
Event sourcing and audit logs: Immutable event records should store the state at the time of the event, not references to mutable entities. An order event should record the product name and price at purchase time, not reference the current product record.
The normalized invoicing schema
CREATE TABLE customers (
id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
name VARCHAR(255) NOT NULL,
email VARCHAR(255) NOT NULL UNIQUE,
address TEXT
);
CREATE TABLE invoices (
id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
customer_id BIGINT NOT NULL REFERENCES customers(id),
invoice_number VARCHAR(50) NOT NULL UNIQUE,
invoice_date DATE NOT NULL,
due_date DATE NOT NULL,
status VARCHAR(50) NOT NULL DEFAULT 'draft'
);
CREATE TABLE invoice_items (
id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
invoice_id BIGINT NOT NULL REFERENCES invoices(id),
description TEXT NOT NULL,
quantity INT NOT NULL CHECK (quantity > 0),
unit_price DECIMAL(10,2) NOT NULL CHECK (unit_price >= 0)
);
Now updating a customer email is one row. Querying "all invoices with an item over $100" is a simple join with an indexed filter. And the schema enforces the business rule that quantities must be positive, at the database level, not just in application code.
The practical standard
Apply normalization to 3NF as your baseline. Denormalize explicitly and document why when you do. If you find yourself writing application code to keep redundant columns in sync, that's a strong signal you've denormalized without the architectural justification to support it.