Metrics and Alerts in Microservices: What You Should Actually Be Watching
by Eric Hanson, Backend Developer at Clean Systems Consulting
The monitoring gap between infrastructure and user experience
Your infrastructure metrics look fine. CPU at 30%, memory at 60%, pods healthy. But your error rate has been at 8% for the last four minutes and no alert has fired. Users are seeing failures. Your on-call engineer found out from a Slack message, not a PagerDuty notification.
The problem is that infrastructure metrics (CPU, memory, disk) don't directly reflect user experience. A service can consume 80% CPU and serve traffic perfectly. A service can consume 20% CPU and return 500 errors for 15% of requests. Alerting on infrastructure thresholds while ignoring user-facing signal metrics means your alerting is optimized for finding noisy servers, not broken services.
The four golden signals
Google's SRE Book defines four golden signals as the primary metrics for service health. These are the metrics that should drive your alerts:
Latency: how long requests take to process. Specifically, P50 (median), P95, and P99. P99 is what your most sensitive users experience. If P99 exceeds your SLA threshold, users are having a bad time even if P50 looks fine.
Traffic: requests per second (or other throughput measure appropriate for your service). Traffic metrics establish the baseline that makes other metrics meaningful. An error rate of 100 errors/minute is trivial at 100,000 requests/minute and catastrophic at 200 requests/minute.
Errors: the rate of failed requests. Distinguish between user errors (4xx, particularly 400, 422) which may be legitimate, and server errors (5xx) which always indicate a problem. Alert on server error rate.
Saturation: how close to capacity the service is. Connection pool utilization, queue depth, thread pool utilization. High saturation precedes failures — a connection pool at 95% utilization is about to cause request queuing and latency spikes.
# Prometheus alerts: golden signals for Order Service
groups:
- name: order-service
rules:
- alert: OrderServiceHighErrorRate
expr: |
rate(http_server_requests_seconds_count{
service="order-service", status=~"5.."
}[5m])
/
rate(http_server_requests_seconds_count{
service="order-service"
}[5m]) > 0.01
for: 2m
labels:
severity: critical
annotations:
summary: "Order Service error rate above 1% for 2 minutes"
- alert: OrderServiceHighLatency
expr: |
histogram_quantile(0.99,
rate(http_server_requests_seconds_bucket{
service="order-service"
}[5m])
) > 2.0
for: 5m
labels:
severity: warning
annotations:
summary: "Order Service P99 latency above 2s"
RED versus USE: choosing the right model per layer
RED (Rate, Errors, Duration): the right model for request-handling services — anything with an API. Rate = requests/sec, Errors = error rate, Duration = latency. Apply this to every service endpoint.
USE (Utilization, Saturation, Errors): the right model for resources — databases, connection pools, queues, thread pools. Utilization = how busy, Saturation = how much work is queuing, Errors = error rate in the resource. Apply this to your infrastructure components.
For a database connection pool, USE gives you: Utilization (active connections / max connections), Saturation (requests waiting for a connection), Errors (connection acquisition failures). A connection pool at 90% utilization with a growing saturation queue is about to become a bottleneck. Alerting on this proactively prevents the cascade where connection pool exhaustion causes upstream service latency, which exhausts upstream thread pools.
Consumer lag for event-driven services
Services that consume from Kafka have an additional critical metric: consumer lag — the number of messages in a partition that have not yet been processed. Increasing consumer lag means your consumer is falling behind producers.
# Alert on Kafka consumer lag
- alert: KafkaConsumerLagHigh
expr: |
kafka_consumer_group_lag{
group="inventory-service",
topic="orders.confirmed"
} > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "Inventory Service is 1000+ messages behind on orders.confirmed"
Consumer lag growth indicates either the consumer is too slow (processing bottleneck), the producer is generating more volume than expected (load spike), or the consumer is down (lag grows rapidly to infinity). Each has a different remediation.
Alerting philosophy: alert on symptoms, not causes
Alert on user-visible symptoms. Root cause analysis happens after the alert fires — not before.
Wrong: alert when CPU > 80%. CPU at 80% might not affect users at all. This produces false positives.
Right: alert when P99 latency > SLA threshold or error rate > X%. These directly impact users. This is what matters.
Causes (high CPU, connection pool saturation, slow queries) are visible in dashboards for diagnosis after the symptom alert fires. Alerting on causes without symptoms produces noisy, low-signal-to-noise-ratio alerts that engineers learn to ignore. Ignoring alerts is worse than having none.
Keep your alert count low and actionable. An on-call rotation with twenty noisy alerts that fire regularly produces alert fatigue. Alert fatigue produces missed incidents. Three to five high-signal alerts per service that fire infrequently and always require action are worth more than twenty alerts that fire several times a week and are usually safe to ignore.