Single Points of Failure Are Hiding in Your System Right Now

by Eric Hanson, Backend Developer at Clean Systems Consulting

The Obvious Ones and the Hidden Ones

Every engineer knows to look for obvious single points of failure: a single database primary, a single application server, a load balancer with no standby. These are easy to spot in an architecture diagram and easy to remediate with redundancy.

The failures that actually cause incidents are the ones that do not appear in the architecture diagram.

Your deployment pipeline. If all deployments run through a single CI/CD server or pipeline with no redundancy, a failure during deployment leaves you with code that cannot be shipped. If a critical fix needs to go out during that failure, you have a problem. Managed CI/CD (GitHub Actions, CircleCI) with redundancy built in is standard. Self-hosted Jenkins with no HA configuration is a hidden SPOF.

Your DNS provider. A misconfiguration or provider outage at your DNS registrar or DNS hosting provider makes your entire domain unreachable. All your application redundancy becomes irrelevant if clients cannot resolve your domain. Using a DNS provider with high availability (Cloudflare DNS, Route 53) and maintaining secondary DNS or NS redundancy is a standard mitigation. Keeping all domains at a single registrar without 2FA on the registrar account is a hidden SPOF.

Configuration and secrets management. If your application fetches configuration or secrets from a single service at startup — a self-hosted Vault instance, an EC2 instance running a config server — that service becomes a SPOF for all deployments and restarts. Instances cannot start without it. It fails; your auto-scaling group cannot replace unhealthy instances.

Your CDN or TLS certificate. A certificate expiration or a CDN misconfiguration can take down HTTPS access to your entire service in minutes. Automated certificate renewal (Let's Encrypt with certbot, AWS Certificate Manager) and CDN configuration in version-controlled infrastructure code reduce this risk.

The Non-Infrastructure SPOFs

Implicit dependency on a single team member. The engineer who built the payment integration is the only one who understands it. They go on vacation. Payment processing has an issue. This is an organizational SPOF. It is not visible in any infrastructure diagram.

A single large database transaction that locks tables. A background job that runs a long-running UPDATE on a large table without proper batching takes a table lock, stalling concurrent reads and writes. The job is infrequent enough that nobody noticed — until it ran on the first day of the month when traffic was highest. Not a hardware SPOF, but functionally equivalent to one.

Centralized session storage without HA. Sessions stored in a single Redis instance with no replication mean a Redis failure logs out all active users simultaneously. Redis Sentinel or Redis Cluster provides HA.

The Audit Process

A useful exercise: for each critical user flow (checkout, authentication, data submission), trace the complete dependency chain and ask: "what single component failure stops this flow from working?"

# Example: checkout flow dependency trace

User -> Load Balancer -> App Server -> Session (Redis) -> Database (Primary)
                                    -> Payment API (external)
                                    -> Email Service (external)
                                    -> Fraud Check (internal service)

SPOFs identified:
- Session Redis: single instance -> add Redis Sentinel
- Database Primary: single AZ -> enable RDS Multi-AZ
- Payment API: no circuit breaker -> add circuit breaker + fallback message
- Fraud Check: synchronous in checkout -> evaluate async post-checkout
- Email Service: if checkout fails when email fails -> move to async queue

Do this for your top three critical flows. You will find at least one SPOF per flow that is not on your architecture diagram. Fix the highest-impact ones before your next incident.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

Recovering From a Failed Software Project

“So… what now?” After the dust settles, this is the question every team has to face.

Read more

How I Make Architecture Decisions Without Endless Meetings

Architecture decisions don't need a calendar invite — they need a clear process, the right people, and a bias toward writing things down. Here's the framework I actually use.

Read more

CDN Is Not Just for Frontend. Backend Developers Need to Understand It Too.

CDNs can offload significant backend load and reduce latency for API consumers, not just browser users fetching assets. Backend engineers who ignore CDN semantics leave performance and resilience on the table.

Read more

Java Code Quality in Practice — The Rules That Help and the Ones That Don't

Most Java code quality guidance is either too abstract to apply or applied too rigidly to improve real codebases. Here is a honest assessment of the rules that consistently improve maintainability and the ones that create friction without payoff.

Read more