Designing for Failure Is Not Optional in Distributed Systems

by Arif Ikhsanudin, Backend Developer

The Assumption That Breaks Everything

Most systems are designed assuming components are available. The application calls the database, the database responds. The service calls the upstream API, the API returns data. This assumption is correct most of the time. When it is wrong — the database is temporarily unreachable, the upstream API times out — systems designed without explicit failure handling have undefined behavior.

Undefined behavior in production means: unknown recovery time, potential data corruption depending on what was in-flight, and engineers debugging under pressure with no runbook.

In a distributed system — anything that makes network calls — component unavailability is not exceptional. Networks partition. Processes crash. Cloud provider zones have incidents. A system with N components, each with 99.9% uptime, has a combined uptime of 0.999^N. At N=10 components, that is 99.0%. That means roughly 8.7 hours of combined failure time per year across the system.

Design for it.

The Failure Modes to Explicitly Handle

Timeouts. Every network call must have a timeout. An external service that stops responding does not return an error — it hangs. Without a timeout, the calling thread hangs indefinitely, holding resources (connections, memory, thread pool slots) until the operating system eventually closes the connection. Under load, this exhausts thread pools quickly.

Set timeouts at the operation level, not the library default. Library defaults are often dangerously long (30 seconds, 60 seconds). A timeout appropriate for your SLA is typically much shorter.

import httpx

# Never: library default (potentially 60+ seconds)
response = httpx.get("https://external-api.com/data")

# Always: explicit timeout
response = httpx.get(
    "https://external-api.com/data",
    timeout=httpx.Timeout(connect=2.0, read=5.0, write=2.0)
)

Circuit breakers. A service that repeatedly fails should not receive continued traffic that drains resources and fails users. A circuit breaker wraps calls to downstream services and tracks failure rate. After a threshold (e.g., 50% failure rate over 10 seconds), the circuit opens: subsequent calls fail fast without attempting the network call. After a timeout, the circuit half-opens and tests whether the downstream has recovered.

Resilience4j (JVM), Polly (.NET), and PyBreaker (Python) implement this pattern. The principle: fail fast when the downstream is known-bad, rather than queuing requests that will fail anyway.

Fallbacks. When a non-critical downstream dependency is unavailable, return a degraded response rather than an error. A product recommendation service that is down should not cause the product page to fail — it should cause the recommendations section to be empty, or served from a cached last-known result. Identifying which downstream dependencies are critical path (page fails without them) versus non-critical (page degrades gracefully) is a design decision that must happen before the incident.

Retry with exponential backoff and jitter. Transient failures — a momentary network hiccup, a brief timeout — are often recoverable with a retry. Immediate retry on failure can overwhelm a struggling service (retry storm). Exponential backoff with jitter — retry after 1s, then 2s, then 4s, with random jitter to spread retries across time — reduces retry pressure while allowing recovery.

The Practices That Prove the Design Works

Chaos testing. Netflix's Chaos Monkey terminates random production instances. The principle is that if you do not regularly test failure handling, you do not know if it works. Start smaller: chaos engineering tools like Gremlin or AWS Fault Injection Simulator let you simulate network latency, dependency unavailability, and instance termination in a controlled way. Test your fallbacks before your on-call engineer is testing them at 2am.

Runbooks for known failure modes. Every failure mode that has been identified in the design should have a documented recovery procedure. "What do we do if the payment service is down?" should have an answer that is findable in 60 seconds. A runbook is not a sign that the system is fragile — it is a sign that the team has thought about failure.

Design for failure explicitly. Every call that can fail will fail eventually. The question is whether the system has a plan.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

The Head Chef Analogy: Why Teams Without a Tech Lead Fail

Imagine walking into a busy kitchen with 10 cooks and no head chef. Food is being made—but no one agrees on how it should taste.

Read more

Why Code Reviews Are Critical for Healthy Engineering Teams

Code reviews are more than a formality—they are the heartbeat of a healthy engineering team. Skipping them may seem faster, but it quietly erodes quality and collaboration.

Read more

Window Functions: The SQL Feature That Changes How You Think About Data

Window functions let you compute aggregations across a set of related rows without collapsing them — once you understand the OVER clause, you stop writing self-joins and correlated subqueries to answer questions about relative position, running totals, and row-by-row comparisons.

Read more

Hibernate Bulk Operations — update_all, delete_all, and Bypassing Entity Lifecycle

Loading entities to update or delete them one at a time is the JPA default and the worst approach for bulk operations. Here is when and how to execute bulk operations efficiently — and what you give up when you bypass the entity lifecycle.

Read more