Monitoring Is Not Optional. It Is How You Know Your App Is Alive.

by Eric Hanson, Backend Developer at Clean Systems Consulting

The Incident You Should Have Caught Yesterday

Your payment service has been returning errors for six hours. Not a total outage — about 3% of requests are failing silently. Affected users are seeing generic error messages. Nobody on the team knows yet because there are no alerts configured on error rate. The first signal is a surge in customer support tickets that morning.

Six hours of 3% payment failures is a significant business event. It could have been detected and resolved in under an hour with a simple error rate alert. Without monitoring, it ran silently until users complained.

This happens on services with no monitoring, and it happens regularly.

The Minimum Viable Monitoring

There's a well-established framework for service monitoring called the Four Golden Signals, from Google's SRE Book. For most services, these four metrics are necessary and sufficient for a minimum monitoring posture:

Latency: How long requests take to process, at p50, p95, and p99. Alert when p99 latency exceeds a threshold that indicates user-visible degradation.

Traffic: Request rate, in requests per second or per minute. Both too high (overload) and too low (potential service failure, traffic routing issue) should alert.

Errors: The rate of failed requests as a percentage of total requests. Even a 1% error rate on a critical service is worth alerting on. A 5% error rate is almost always an incident.

Saturation: How "full" the service's resources are — CPU, memory, thread pool utilization, connection pool utilization. Saturation trending toward limits is a leading indicator before things break.

These four metrics, with sensible alert thresholds, will catch the majority of production incidents before users report them.

The Alert Quality Problem

Many teams add monitoring without adding useful monitoring. The failure modes:

Alerting on symptoms rather than causes: "CPU is above 80%" is a symptom. It may or may not correspond to user-visible impact. "Error rate is above 1%" is user-visible impact, and it's almost always worth alerting on regardless of the underlying cause.

Alert fatigue from noisy alerts: Alerts that fire for brief spikes that don't represent real problems train engineers to ignore alerts. If your alert fires five times a week and three of them are false positives, the real incident is the one you ignore because you've been conditioned to assume it's also a false positive.

Missing alerts on gradual degradation: Acute failures (sudden error spike, service down) are usually obvious. Gradual degradation (p99 latency trending up by 20% per week, successful request rate slowly declining) is invisible without trend-based alerting.

The Alerting Principle: Alert on User Impact

The right test for any alert: does this alert firing represent a degradation in the experience of users of this service right now? If yes, it should alert. If no, it should be a metric available for investigation, not an alert.

"Database CPU is at 70%" — should not alert, because 70% CPU doesn't by itself represent user impact. "p99 latency exceeds 2,000ms" — should alert, because users are waiting 2 seconds for operations that should be sub-200ms. "Error rate exceeds 1%" — should alert, because 1 in 100 users is seeing a failure. "Thread pool utilization exceeds 90%" — should alert as a warning, because this is a leading indicator that will produce user impact soon.

What to Instrument

For HTTP services, most of the Four Golden Signals come from the request/response cycle. Frameworks and APM tools (Datadog APM, New Relic, Prometheus with Micrometer) typically make this easy to instrument at the middleware layer without per-endpoint code.

Beyond the basic signals, instrument your business operations explicitly:

// Technical metrics are captured by infrastructure
// Business metrics require explicit instrumentation
meterRegistry.counter("orders.created", 
    "status", order.getStatus().name(),
    "payment_method", order.getPaymentMethod()).increment();

meterRegistry.timer("order.processing.duration").record(duration);

An order service should be able to tell you how many orders were created in the last hour, what the success rate was, and what the p99 processing duration was — not just whether the HTTP endpoint was healthy.

The Runbook as Part of Monitoring

Every alert should link to a runbook. Not a runbook that says "investigate the logs" — one that says:

  1. This alert fires when error rate exceeds 1% for 5 minutes
  2. Common causes: database connection pool exhaustion, payment provider timeout, recent deploy regression
  3. Investigation steps: check connection pool metrics, check payment provider status page, check recent deploy logs
  4. Remediation: connection pool — increase pool size or restart; provider timeout — circuit breaker should activate; deploy regression — roll back

An alert without a runbook means the engineer who gets paged at 2am starts from scratch every time.

The Practical Takeaway

For the most critical service your team operates, check: what would you know, and how quickly, if error rate hit 5%? If the answer is "we'd know when users told us," add an error rate alert today. Pick a threshold (1% for critical services, 5% for less critical), set the alert, and write the one-paragraph runbook that tells whoever gets paged what to check first.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

When One Developer Knows Everything About the System

It feels reassuring to have one person who understands everything. Until you realize that person has quietly become your biggest bottleneck.

Read more

Event-Driven Architecture: The Service Communication Style Worth Understanding

Event-driven architecture eliminates temporal coupling between services, but introduces consistency, ordering, and observability challenges that teams routinely underestimate. Here is what it actually takes to make it work.

Read more

Ruby Performance Tips I Learned the Hard Way on a Production System

Most Ruby performance advice is synthetic benchmark folklore. These are patterns that caused measurable production problems — and the specific changes that fixed them.

Read more

JPA Query Optimization — What Hibernate Generates and How to Control It

Hibernate generates SQL from your entity model and query methods. The generated SQL is often correct but rarely optimal. Understanding what gets generated — and the specific patterns that override it — determines whether JPA is a productivity tool or a performance liability.

Read more