System Design Is Not About Drawing Pretty Diagrams
by Eric Hanson, Backend Developer at Clean Systems Consulting
The Diagram Is a Lie
You have been in that meeting. Someone opens a whiteboard tool, draws boxes connected by arrows, labels them "API Gateway," "Service A," "Service B," "Database," and calls it a design. Everyone nods. The diagram looks clean. Nothing in it tells you what happens when Service B goes down, what the write throughput ceiling is, or whether you have chosen the right consistency model for the problem you are actually solving.
System design is a series of decisions under uncertainty. The diagram is just a way to communicate some of those decisions — after you have made them. Treating diagram production as the design process itself is how teams end up with architectures that are aesthetically coherent and operationally broken.
What Design Actually Is
A system design is a set of answers to hard questions. Before any box gets drawn:
- What is the read-to-write ratio, and does it change under load?
- What is the acceptable data loss window — seconds, minutes, zero?
- What is the latency requirement at p99, not just average?
- Which failure modes are tolerable and which are not?
- What does the team have operational experience running?
None of these show up in the diagram. A box labeled "Cache Layer" does not tell you whether you chose Redis with read replicas, a local in-process cache, or a CDN edge cache — and those three choices have completely different operational characteristics, invalidation behaviors, and failure modes.
The decisions are the design. The diagram annotates the decisions for people who weren't in the room.
Where Real Design Happens
Real design happens when you sit with constraints. Consider a system handling financial transactions with these requirements: 10,000 writes per second, strict ordering per account, five-nines availability, auditable history.
A diagram might show: Client → API → Queue → Worker → DB. That diagram is compatible with dozens of different implementations. The design decisions narrow it:
- The queue must be Kafka with a partition key on account ID to guarantee per-account ordering without a global bottleneck
- The database needs to support serializable isolation or you use optimistic locking with retry logic in the worker
- The worker must be idempotent because Kafka delivery guarantees at-least-once, not exactly-once
- The audit log is an append-only event store, not a mutable record with updated_at timestamps
# Kafka topic config for per-account ordering
num.partitions=128
# partition key = account_id ensures ordering within an account
# across 128 partitions for parallelism
# Worker idempotency check
INSERT INTO transactions (id, account_id, amount, created_at)
VALUES ($1, $2, $3, $4)
ON CONFLICT (id) DO NOTHING;
None of that is visible in the pretty diagram. All of it matters for whether the system works.
The Diagram Test
Here is a useful heuristic: if you can swap out the label on any box in your diagram without changing anything else in the diagram, that box is not designed — it is wished for.
"Cache" is not a design decision. "Redis 7 with a 60-second TTL on product catalog reads, write-through on mutations, no caching on user-specific data" is a design decision. The former fits in a box. The latter fits in a document that explains the tradeoff you made and why you made it.
The same applies to every component. "Message Queue" is not a decision. The choice between SQS with FIFO queues versus Kafka versus RabbitMQ with quorum queues involves latency characteristics, durability guarantees, consumer group semantics, and ordering behavior — all of which depend on your specific workload.
What Good Design Documentation Looks Like
The most useful design artifact is not the diagram. It is the Architecture Decision Record (ADR) — a short document capturing the context, the decision, the alternatives considered, and the consequences including the downsides.
# ADR-012: Use Kafka for order event streaming
## Context
Order events must be processed by three downstream consumers
(inventory, billing, analytics) with different throughput requirements.
We need replay capability for backfill and at-least-once delivery.
## Decision
Kafka with consumer groups per downstream service.
## Alternatives Considered
- SQS fan-out via SNS: no replay capability, higher cost at volume
- RabbitMQ: adequate for current load but no built-in replay
## Consequences
- Operational complexity: we need to manage broker, ZooKeeper/KRaft
- Consumers must be idempotent — Kafka does not guarantee exactly-once
without transactions enabled (adds latency we cannot absorb)
- Retention period set to 7 days; backfill beyond that requires S3 offload
That document is worth more than any diagram. It tells the next engineer why the system is the way it is and what you gave up to get there.
The Practical Shift
Stop leading design sessions by asking "what does the architecture look like." Start by asking "what are the hardest constraints we are working within." Write those down first. Let the diagram emerge from the answers, not the other way around.
Draw the diagram last. Write the decisions first. That is the sequence that produces systems that survive contact with production.