Monitoring Is Not Optional. It Is How You Know Your App Is Alive.
by Eric Hanson, Backend Developer at Clean Systems Consulting
The Incident You Should Have Caught Yesterday
Your payment service has been returning errors for six hours. Not a total outage — about 3% of requests are failing silently. Affected users are seeing generic error messages. Nobody on the team knows yet because there are no alerts configured on error rate. The first signal is a surge in customer support tickets that morning.
Six hours of 3% payment failures is a significant business event. It could have been detected and resolved in under an hour with a simple error rate alert. Without monitoring, it ran silently until users complained.
This happens on services with no monitoring, and it happens regularly.
The Minimum Viable Monitoring
There's a well-established framework for service monitoring called the Four Golden Signals, from Google's SRE Book. For most services, these four metrics are necessary and sufficient for a minimum monitoring posture:
Latency: How long requests take to process, at p50, p95, and p99. Alert when p99 latency exceeds a threshold that indicates user-visible degradation.
Traffic: Request rate, in requests per second or per minute. Both too high (overload) and too low (potential service failure, traffic routing issue) should alert.
Errors: The rate of failed requests as a percentage of total requests. Even a 1% error rate on a critical service is worth alerting on. A 5% error rate is almost always an incident.
Saturation: How "full" the service's resources are — CPU, memory, thread pool utilization, connection pool utilization. Saturation trending toward limits is a leading indicator before things break.
These four metrics, with sensible alert thresholds, will catch the majority of production incidents before users report them.
The Alert Quality Problem
Many teams add monitoring without adding useful monitoring. The failure modes:
Alerting on symptoms rather than causes: "CPU is above 80%" is a symptom. It may or may not correspond to user-visible impact. "Error rate is above 1%" is user-visible impact, and it's almost always worth alerting on regardless of the underlying cause.
Alert fatigue from noisy alerts: Alerts that fire for brief spikes that don't represent real problems train engineers to ignore alerts. If your alert fires five times a week and three of them are false positives, the real incident is the one you ignore because you've been conditioned to assume it's also a false positive.
Missing alerts on gradual degradation: Acute failures (sudden error spike, service down) are usually obvious. Gradual degradation (p99 latency trending up by 20% per week, successful request rate slowly declining) is invisible without trend-based alerting.
The Alerting Principle: Alert on User Impact
The right test for any alert: does this alert firing represent a degradation in the experience of users of this service right now? If yes, it should alert. If no, it should be a metric available for investigation, not an alert.
"Database CPU is at 70%" — should not alert, because 70% CPU doesn't by itself represent user impact. "p99 latency exceeds 2,000ms" — should alert, because users are waiting 2 seconds for operations that should be sub-200ms. "Error rate exceeds 1%" — should alert, because 1 in 100 users is seeing a failure. "Thread pool utilization exceeds 90%" — should alert as a warning, because this is a leading indicator that will produce user impact soon.
What to Instrument
For HTTP services, most of the Four Golden Signals come from the request/response cycle. Frameworks and APM tools (Datadog APM, New Relic, Prometheus with Micrometer) typically make this easy to instrument at the middleware layer without per-endpoint code.
Beyond the basic signals, instrument your business operations explicitly:
// Technical metrics are captured by infrastructure
// Business metrics require explicit instrumentation
meterRegistry.counter("orders.created",
"status", order.getStatus().name(),
"payment_method", order.getPaymentMethod()).increment();
meterRegistry.timer("order.processing.duration").record(duration);
An order service should be able to tell you how many orders were created in the last hour, what the success rate was, and what the p99 processing duration was — not just whether the HTTP endpoint was healthy.
The Runbook as Part of Monitoring
Every alert should link to a runbook. Not a runbook that says "investigate the logs" — one that says:
- This alert fires when error rate exceeds 1% for 5 minutes
- Common causes: database connection pool exhaustion, payment provider timeout, recent deploy regression
- Investigation steps: check connection pool metrics, check payment provider status page, check recent deploy logs
- Remediation: connection pool — increase pool size or restart; provider timeout — circuit breaker should activate; deploy regression — roll back
An alert without a runbook means the engineer who gets paged at 2am starts from scratch every time.
The Practical Takeaway
For the most critical service your team operates, check: what would you know, and how quickly, if error rate hit 5%? If the answer is "we'd know when users told us," add an error rate alert today. Pick a threshold (1% for critical services, 5% for less critical), set the alert, and write the one-paragraph runbook that tells whoever gets paged what to check first.