Single Points of Failure Are Hiding in Your System Right Now
by Eric Hanson, Backend Developer at Clean Systems Consulting
The Obvious Ones and the Hidden Ones
Every engineer knows to look for obvious single points of failure: a single database primary, a single application server, a load balancer with no standby. These are easy to spot in an architecture diagram and easy to remediate with redundancy.
The failures that actually cause incidents are the ones that do not appear in the architecture diagram.
Your deployment pipeline. If all deployments run through a single CI/CD server or pipeline with no redundancy, a failure during deployment leaves you with code that cannot be shipped. If a critical fix needs to go out during that failure, you have a problem. Managed CI/CD (GitHub Actions, CircleCI) with redundancy built in is standard. Self-hosted Jenkins with no HA configuration is a hidden SPOF.
Your DNS provider. A misconfiguration or provider outage at your DNS registrar or DNS hosting provider makes your entire domain unreachable. All your application redundancy becomes irrelevant if clients cannot resolve your domain. Using a DNS provider with high availability (Cloudflare DNS, Route 53) and maintaining secondary DNS or NS redundancy is a standard mitigation. Keeping all domains at a single registrar without 2FA on the registrar account is a hidden SPOF.
Configuration and secrets management. If your application fetches configuration or secrets from a single service at startup — a self-hosted Vault instance, an EC2 instance running a config server — that service becomes a SPOF for all deployments and restarts. Instances cannot start without it. It fails; your auto-scaling group cannot replace unhealthy instances.
Your CDN or TLS certificate. A certificate expiration or a CDN misconfiguration can take down HTTPS access to your entire service in minutes. Automated certificate renewal (Let's Encrypt with certbot, AWS Certificate Manager) and CDN configuration in version-controlled infrastructure code reduce this risk.
The Non-Infrastructure SPOFs
Implicit dependency on a single team member. The engineer who built the payment integration is the only one who understands it. They go on vacation. Payment processing has an issue. This is an organizational SPOF. It is not visible in any infrastructure diagram.
A single large database transaction that locks tables. A background job that runs a long-running UPDATE on a large table without proper batching takes a table lock, stalling concurrent reads and writes. The job is infrequent enough that nobody noticed — until it ran on the first day of the month when traffic was highest. Not a hardware SPOF, but functionally equivalent to one.
Centralized session storage without HA. Sessions stored in a single Redis instance with no replication mean a Redis failure logs out all active users simultaneously. Redis Sentinel or Redis Cluster provides HA.
The Audit Process
A useful exercise: for each critical user flow (checkout, authentication, data submission), trace the complete dependency chain and ask: "what single component failure stops this flow from working?"
# Example: checkout flow dependency trace
User -> Load Balancer -> App Server -> Session (Redis) -> Database (Primary)
-> Payment API (external)
-> Email Service (external)
-> Fraud Check (internal service)
SPOFs identified:
- Session Redis: single instance -> add Redis Sentinel
- Database Primary: single AZ -> enable RDS Multi-AZ
- Payment API: no circuit breaker -> add circuit breaker + fallback message
- Fraud Check: synchronous in checkout -> evaluate async post-checkout
- Email Service: if checkout fails when email fails -> move to async queue
Do this for your top three critical flows. You will find at least one SPOF per flow that is not on your architecture diagram. Fix the highest-impact ones before your next incident.