What Fault Tolerance Actually Means in a Real Backend System
by Eric Hanson, Backend Developer at Clean Systems Consulting
The Definition That Actually Matters
"The system is fault-tolerant" means nothing operationally useful. Fault-tolerant to what? A database primary failure? A single AZ outage? A network partition between two services? A 10% packet loss on an external API connection? These are different failure scenarios requiring different design responses, and calling a system "fault-tolerant" without specifying the failure mode is a comfort statement, not a design commitment.
A useful definition: fault tolerance is the system's ability to continue operating — possibly in a degraded mode — when a specified set of failure conditions occurs. The specification matters. Design for undefined failure conditions produces undefined behavior.
The Failure Taxonomy
Failures fall into a few categories with different handling strategies:
Single instance failures. An application server crashes. A database connection is dropped. A worker process OOMs and terminates. These are expected, routine events in any distributed system. The design response is redundancy: run multiple instances so no single instance failure takes down the service. Health checks remove failed instances from the pool. Process supervisors (systemd, Kubernetes pod restarts) restart failed processes.
Downstream service failures. An external API is unavailable. An internal microservice is returning 5xx errors. The database replica is unreachable. The design response is isolation: circuit breakers prevent continued calls to failed services, timeouts prevent resource exhaustion, fallback behavior defines what the system does when the downstream is unavailable.
Data store failures. The primary database is unreachable. A Redis cluster is down. The design response is failover: a replica is promoted to primary (RDS Multi-AZ, Cloud SQL HA), or a fallback data source is used. Recovery time objective (RTO) and recovery point objective (RPO) are determined by replication lag and failover automation.
Partial failures (gray failures). The service is responding but slowly — 30% of requests time out, 5% return errors. This is harder than a clean failure. Circuit breakers may not trip because error rates are below threshold. Users see degraded but not complete unavailability. The design response is fine-grained monitoring with p99 latency alerts rather than error rate alerts alone, and aggressive timeout policies that make partial availability fail fast.
Fault Tolerance Requires Specifying the SLA
Before implementing fault tolerance mechanisms, answer: what failure scenarios must the system tolerate, and what behavior is acceptable during those failures?
# Fault tolerance specification (example):
Failure: Single AZ unavailability (full AZ down)
Required behavior: System continues operating
Acceptable degradation: Elevated latency during failover (< 60 seconds)
Design: Multi-AZ deployment, RDS Multi-AZ standby, cross-AZ load balancing
Failure: Payment service unavailable
Required behavior: Checkout page degrades gracefully
Acceptable degradation: Card payments disabled, show "try again" messaging
Design: Circuit breaker on payment service, fallback checkout flow
Failure: Cache unavailable (Redis down)
Required behavior: Application continues, higher latency
Acceptable degradation: p95 latency increases from 80ms to 400ms
Design: Cache-aside pattern with database fallback, no cache required for correctness
Failure: Third-party analytics service unavailable
Required behavior: No user impact
Acceptable degradation: Analytics data gap for outage duration
Design: Fire-and-forget async dispatch, no retry, no impact on request path
What It Costs
Fault tolerance is not free. Each failure mode handled adds complexity. A circuit breaker requires configuration and testing. Multi-AZ deployment roughly doubles compute costs. Fallback behavior requires maintaining two code paths. The question is whether the cost of the fault tolerance mechanism is proportionate to the cost of the failure it prevents.
Classify your failure scenarios by probability and impact. Handle high-probability, high-impact failures (single AZ outage, cache unavailability) with robust automated mechanisms. Handle low-probability, high-impact failures (full region outage) with documented runbooks and acceptable RTO. Accept low-probability, low-impact failures gracefully without complex automation.
Fault tolerance proportionate to actual risk is good engineering. Fault tolerance applied uniformly to all failure modes regardless of probability is over-engineering.