System Design Is Not About Drawing Pretty Diagrams

by Eric Hanson, Backend Developer at Clean Systems Consulting

The Diagram Is a Lie

You have been in that meeting. Someone opens a whiteboard tool, draws boxes connected by arrows, labels them "API Gateway," "Service A," "Service B," "Database," and calls it a design. Everyone nods. The diagram looks clean. Nothing in it tells you what happens when Service B goes down, what the write throughput ceiling is, or whether you have chosen the right consistency model for the problem you are actually solving.

System design is a series of decisions under uncertainty. The diagram is just a way to communicate some of those decisions — after you have made them. Treating diagram production as the design process itself is how teams end up with architectures that are aesthetically coherent and operationally broken.

What Design Actually Is

A system design is a set of answers to hard questions. Before any box gets drawn:

  • What is the read-to-write ratio, and does it change under load?
  • What is the acceptable data loss window — seconds, minutes, zero?
  • What is the latency requirement at p99, not just average?
  • Which failure modes are tolerable and which are not?
  • What does the team have operational experience running?

None of these show up in the diagram. A box labeled "Cache Layer" does not tell you whether you chose Redis with read replicas, a local in-process cache, or a CDN edge cache — and those three choices have completely different operational characteristics, invalidation behaviors, and failure modes.

The decisions are the design. The diagram annotates the decisions for people who weren't in the room.

Where Real Design Happens

Real design happens when you sit with constraints. Consider a system handling financial transactions with these requirements: 10,000 writes per second, strict ordering per account, five-nines availability, auditable history.

A diagram might show: Client → API → Queue → Worker → DB. That diagram is compatible with dozens of different implementations. The design decisions narrow it:

  • The queue must be Kafka with a partition key on account ID to guarantee per-account ordering without a global bottleneck
  • The database needs to support serializable isolation or you use optimistic locking with retry logic in the worker
  • The worker must be idempotent because Kafka delivery guarantees at-least-once, not exactly-once
  • The audit log is an append-only event store, not a mutable record with updated_at timestamps
# Kafka topic config for per-account ordering
num.partitions=128
# partition key = account_id ensures ordering within an account
# across 128 partitions for parallelism

# Worker idempotency check
INSERT INTO transactions (id, account_id, amount, created_at)
VALUES ($1, $2, $3, $4)
ON CONFLICT (id) DO NOTHING;

None of that is visible in the pretty diagram. All of it matters for whether the system works.

The Diagram Test

Here is a useful heuristic: if you can swap out the label on any box in your diagram without changing anything else in the diagram, that box is not designed — it is wished for.

"Cache" is not a design decision. "Redis 7 with a 60-second TTL on product catalog reads, write-through on mutations, no caching on user-specific data" is a design decision. The former fits in a box. The latter fits in a document that explains the tradeoff you made and why you made it.

The same applies to every component. "Message Queue" is not a decision. The choice between SQS with FIFO queues versus Kafka versus RabbitMQ with quorum queues involves latency characteristics, durability guarantees, consumer group semantics, and ordering behavior — all of which depend on your specific workload.

What Good Design Documentation Looks Like

The most useful design artifact is not the diagram. It is the Architecture Decision Record (ADR) — a short document capturing the context, the decision, the alternatives considered, and the consequences including the downsides.

# ADR-012: Use Kafka for order event streaming

## Context
Order events must be processed by three downstream consumers
(inventory, billing, analytics) with different throughput requirements.
We need replay capability for backfill and at-least-once delivery.

## Decision
Kafka with consumer groups per downstream service.

## Alternatives Considered
- SQS fan-out via SNS: no replay capability, higher cost at volume
- RabbitMQ: adequate for current load but no built-in replay

## Consequences
- Operational complexity: we need to manage broker, ZooKeeper/KRaft
- Consumers must be idempotent — Kafka does not guarantee exactly-once
  without transactions enabled (adds latency we cannot absorb)
- Retention period set to 7 days; backfill beyond that requires S3 offload

That document is worth more than any diagram. It tells the next engineer why the system is the way it is and what you gave up to get there.

The Practical Shift

Stop leading design sessions by asking "what does the architecture look like." Start by asking "what are the hardest constraints we are working within." Write those down first. Let the diagram emerge from the answers, not the other way around.

Draw the diagram last. Write the decisions first. That is the sequence that produces systems that survive contact with production.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

Why New York Fintech Startups Are Quietly Outsourcing Backend Work to Async Contractors

Your compliance team is growing faster than your engineering team. And somehow you're still behind on the payments integration that was supposed to ship last quarter.

Read more

Why Oslo Startups Are Using Remote Backend Contractors to Escape Norway's Salary Spiral

Every year the salary expectation goes up. Every year your runway gets shorter. At some point the maths stops working — and you need a different equation.

Read more

Accidentally Publishing Half-Finished Code: How to Recover

You push your code, confident everything is ready… and then you realize part of it wasn’t supposed to go live.

Read more

The Decorator Pattern in Ruby — Clean Code Without the Bloat

Decorators solve the problem of adding behavior to objects without subclassing, but Ruby gives you several ways to implement them — each with different tradeoffs around interface fidelity, performance, and testability.

Read more