Why Message Queues Change the Way You Think About System Design
by Eric Hanson, Backend Developer at Clean Systems Consulting
The Synchronous Default
Most systems are built synchronously by default. A request comes in, the application performs all the work — database writes, email sending, third-party API calls — and returns a response when everything is done. This model is simple and easy to reason about. It is also fragile: if any step in the chain fails, the whole request fails. If any step is slow, the whole request is slow.
Adding a message queue changes the model fundamentally. Instead of "do the work now, in the request," the model becomes "record the intent now, do the work later." The request returns immediately. A consumer processes the work asynchronously. The two operations are decoupled in time.
This decoupling has consequences that go beyond performance.
What Decoupling Actually Gives You
Producer-consumer independence. When the service responsible for sending emails is down, a synchronous system returns errors to users trying to trigger email-sending actions. A queue-based system accepts the action, persists the message to the queue, and sends the email when the email service recovers. The user experience is preserved. The email is just delayed.
This is not magic — it requires that delayed processing is acceptable for the operation. For a welcome email, a 10-minute delay is fine. For a two-factor authentication code, it is not. The design decision is identifying which operations can tolerate async processing.
Backpressure and rate limiting. A consumer pulls messages at a rate it can handle. If the producer generates messages faster than the consumer can process them, messages queue up. The queue absorbs the burst. The consumer processes at a steady rate. Without a queue, the producer would overwhelm the consumer, causing failures or cascading latency.
# Without queue: producer directly calls consumer
# Under spike: producer generates 500 req/s, consumer handles 100 req/s
# Result: consumer is overwhelmed, failures cascade to producer
# With queue: producer enqueues at 500 req/s, consumer dequeues at 100 req/s
# Result: queue depth grows during spike, consumer catches up afterward
# Producer never sees consumer failures
Retry and durability. A message in a durable queue (SQS, RabbitMQ with persistence, Kafka) is not lost if the consumer crashes mid-processing. The message is redelivered to another consumer. The consumer must be idempotent — processing the same message twice must be safe — but the operation is guaranteed to complete eventually.
Implementing equivalent retry logic in synchronous systems is harder. You need to track which operations completed, which failed, and retry them correctly — essentially reimplementing queue semantics manually.
Audit and replay. A queue with sufficient retention (Kafka with configurable retention, SQS with 14-day retention) provides a history of operations. You can replay events to rebuild state, debug past behavior, or feed new consumers with historical data. This is not possible with synchronous, fire-and-forget calls.
The New Complexity
Queues are not free. They introduce:
- Eventual processing: the user submits an action and it completes later. The UI must communicate this honestly — "your export is being processed, you will receive an email when complete" rather than returning the download immediately.
- Idempotency requirement: consumers must be written to handle duplicate delivery safely. This usually means tracking processed message IDs or using upsert operations rather than insert.
- Consumer management: monitoring queue depth, consumer lag, and dead-letter queues for messages that failed repeated processing.
- Ordering complexity: SQS standard queues do not guarantee ordering. Kafka guarantees ordering within a partition. If processing order matters, the queue choice and partition strategy must reflect that.
When to Introduce a Queue
A message queue earns its keep when one or more of these is true: the operation is not on the user-facing critical path (background work), the downstream system is slower or less reliable than your application, you need retry logic without writing it manually, or bursts of work need to be absorbed without cascading to the producer.
Start with the synchronous version. Introduce a queue when you can clearly name which of these benefits you are getting and why the synchronous version has failed to provide it.