What Happens to Your System When the Queue Backs Up

by Arif Ikhsanudin, Backend Developer

The Queue Is Growing and Not Recovering

Your processing pipeline uses SQS with 10 worker instances. Normally, messages arrive at 200/min and workers process at 500/min — comfortable headroom. During a traffic spike, message arrival hits 1,200/min. Workers are running at capacity. Queue depth starts climbing. The spike passes after 20 minutes. Queue depth is now at 4,000 messages.

The question is what happens next. If workers process at 500/min and new messages arrive at 200/min, net drain rate is 300/min. 4,000 messages drains in 13 minutes. That is fine.

Now consider a variation: the spike causes workers to process expensive messages — ones that each take 3 seconds rather than the normal 0.5 seconds. Processing throughput drops to 100/min. New messages still arrive at 200/min. Queue depth grows by 100/min indefinitely. This is a backlog, not a temporary buffer. It does not self-resolve.

The Cascade Inside a Backed-Up Queue

A backed-up queue produces second-order effects beyond the obvious delay:

Message age and time-sensitivity. Messages that assume immediacy — notifications, time-sensitive alerts, real-time data updates — become incorrect when delayed by hours. A "your order is being processed" notification that arrives 3 hours after the order is confusing. A fraud alert that fires 6 hours after the transaction is useless. If messages have time sensitivity, they need TTLs. SQS supports message-level visibility timeout and queue-level retention periods. Messages beyond their useful window should be discarded or routed to a dead-letter queue, not processed stale.

Memory and resource exhaustion. Workers holding large numbers of in-flight messages — messages that have been received but not yet acknowledged — accumulate memory usage. If processing each message allocates significant heap space, a backed-up queue of in-flight messages causes memory pressure. This can trigger GC pressure, OOM errors, or worker crashes — which reduces processing capacity, which worsens the backlog.

Dead-letter queue accumulation. Messages that fail repeatedly and exhaust their retry count route to a dead-letter queue. A backed-up queue under load means more processing failures (timeouts, dependencies under stress), which means more DLQ accumulation. Without active monitoring and remediation, the DLQ silently accumulates permanent failures that are invisible to users until someone checks.

# SQS configuration for a backed-up queue scenario:

Message attributes to set:
  MessageRetentionPeriod:     86400 (24 hours max)
  VisibilityTimeout:          processing_time * 1.5 (give workers room)
  ReceiveMessageWaitTime:     20 (long polling, reduce API calls)
  MaxReceiveCount:            3 (before moving to DLQ)

DLQ monitoring alerts:
  - Alert when DLQ depth > 0 (every DLQ message is a processing failure)
  - Alert when main queue depth exceeds 10 minutes of normal throughput
  - Alert on queue consumer lag > threshold

Designing Against Backlog Accumulation

Autoscale consumers on queue depth, not CPU. CPU-based autoscaling is a lagging indicator. By the time CPU signals to scale, the queue is already backing up. CloudWatch alarms on queue depth trigger autoscaling groups or Lambda concurrency increases faster.

Make processing fast and bounded. Expensive work inside a queue consumer — heavy computation, chained external API calls — slows processing and reduces throughput under load. Move expensive work outside the consumer or into nested async processing. A consumer that receives a message, validates it, and enqueues a more specific job for a specialized worker is faster than a consumer that does all the work.

Implement per-message timeouts. A consumer that hangs indefinitely on a slow database call holds a message invisible from other consumers (SQS visibility timeout) and blocks a worker thread. Set explicit operation-level timeouts inside consumer code. A consumer that fails fast returns the message to the queue for reprocessing rather than holding it invisible while hung.

Separate queues by message priority and cost. A single queue mixing fast messages and slow messages means slow messages degrade fast message throughput. Separate queues per message type, with separate consumer pools, lets you apply different scaling policies and processing SLAs per message type.

The queue is not a guarantee that work completes. It is a guarantee that work is durably stored until a consumer handles it. The consumer design determines whether the work actually completes correctly and within a useful time window.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

Risk Management in Software Development

Software projects rarely fail because of one big mistake. They fail because of many small risks left unchecked.

Read more

How I Use Form Objects to Keep Rails Controllers Clean

Multi-model forms, complex validation logic, and params that don't map cleanly to database columns are where Rails' built-in form handling breaks down. Form objects fix all three without pulling in a framework.

Read more

Second-Level Cache in Hibernate — When It Helps and When It's a Trap

Hibernate's second-level cache sits between the application and the database, caching entities across sessions. Configured correctly it eliminates repeated reads. Configured wrong it serves stale data silently, produces hard-to-debug invalidation failures, and breaks distributed deployments.

Read more

The Day Your Deployment Broke Everything

Deployments are supposed to be exciting, not terrifying. But sometimes, one push to production can turn your day upside down.

Read more