Observability Is Not Just Logging. Here Is What You Are Missing.
by Arif Ikhsanudin, Backend Developer
Why logs alone aren't enough
Your service emits excellent logs. Structured JSON, correlation IDs, error stacks. When something fails, you can tell exactly what happened inside that service. What you can't answer from logs alone: why is the 95th percentile latency for checkout up 40% today compared to last Tuesday? Which downstream service is responsible? Is this affecting all users or a specific segment? Is this correlated with a deployment that happened two hours ago?
Logs are an event-centric view of a system: things that happened, at specific moments, with context. They're essential but insufficient for understanding system behavior over time, across services, and at scale. Observability — the ability to understand your system's internal state from its external outputs — requires three types of signal that serve different analytical purposes: metrics, logs, and traces.
Metrics: the aggregate view
Metrics are numerical measurements aggregated over time. They answer "how is the system performing?" rather than "what happened in this specific request?" A metric like http_server_requests_seconds is a histogram across thousands of requests, giving you the statistical distribution of response times.
Metrics are cheap to store and query at scale precisely because they're aggregated. You don't store every individual request duration — you store bucket counts in a histogram. Prometheus with Grafana is the standard open-source stack; the Micrometer library in Java exposes Spring Boot metrics in Prometheus format automatically.
What metrics tell you that logs can't:
- Trend analysis: is latency slowly degrading over days?
- Capacity planning: at current growth rate, when will you hit connection pool limits?
- Anomaly detection: is this metric outside its normal range for this time of day?
- Correlated failures: did error rates in Service A increase when Service B was deployed?
// Custom business metric with Micrometer
Counter orderCounter = Counter.builder("orders.created")
.tag("payment_method", paymentMethod)
.tag("region", region)
.register(meterRegistry);
orderCounter.increment();
This counter, scraped by Prometheus, lets you build dashboards showing order volume by payment method and region — a business signal that pure infrastructure metrics never surface.
Traces: the request-centric view
Traces follow a single request across every service it touches. They answer "where did this specific request spend its time, and where did it fail?" — something neither metrics (aggregated) nor logs (per-service) can answer alone.
The connection between metrics and traces is what makes the combination powerful. A metric alert fires: P99 latency for checkout is above SLA. You open a trace for a recent slow checkout. The trace shows that 80% of the latency is in the Inventory Service's database query. You open the Inventory Service's query performance logs. You find a specific slow query plan. You've moved from system-level alert to specific root cause in three steps, without guessing.
Without traces, the path from "checkout is slow" to "this specific database query is the cause" involves multiple disconnected tools, multiple teams, and significant time. With traces connected to metrics and logs, it's a directed investigation.
The connection layer: correlation IDs and exemplars
The three signals are most powerful when they reference each other. This requires two things:
Correlation IDs in logs and traces: every log line should include the trace ID. In OpenTelemetry, the trace context is available via the current span, and can be injected into your structured logging context:
// Inject trace context into log MDC automatically
// OpenTelemetry agent handles this via the log bridge
// Result: every log line includes trace_id and span_id
{
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"message": "Inventory reservation attempted",
"item_id": "sku-774",
"quantity": 2
}
Now you can go from a metric anomaly → find a representative trace → follow the trace ID to specific log lines from every service in that trace.
Exemplars in metrics: Prometheus supports exemplars — attaching a trace ID to a specific metric observation. When investigating a P99 latency anomaly, you can jump directly from the metric to a real trace that contributed to that metric bucket:
// Micrometer exemplar support with OpenTelemetry
Timer.builder("checkout.duration")
.register(meterRegistry)
.record(() -> {
// The current trace ID is automatically attached as an exemplar
return checkoutService.process(request);
});
In Grafana, clicking on a metric data point with exemplars shows you a direct link to a trace from that time window. This is the connection that makes "the metric spiked at 14:47" lead directly to "here's a trace from that spike."
The OpenTelemetry stack in practice
OpenTelemetry (OTel) is the CNCF standard for instrumentation, replacing the fragmented landscape of Jaeger clients, Zipkin clients, and proprietary SDKs. One instrumentation library, pluggable backends.
The reference architecture:
- OpenTelemetry agent (Java, .NET, Python, Node.js) auto-instruments your services for traces and metrics with zero code changes
- OpenTelemetry Collector receives traces, metrics, and logs; applies processing (sampling, attribute filtering); exports to backends
- Prometheus for metrics storage and alerting
- Grafana Tempo or Jaeger for trace storage
- Grafana Loki or Elasticsearch for log storage
- Grafana unified dashboard connecting all three
This stack is fully open-source, runs on Kubernetes, and costs significantly less than managed alternatives at moderate scale. At high scale (> 10TB/day of telemetry), managed offerings (Datadog, Honeycomb, Grafana Cloud) become cost-competitive with operational complexity savings.
Observability is not a tool. It's the capability to ask arbitrary questions about your system's behavior and get answers quickly. Logs, metrics, and traces — individually — give you partial answers. Connected and queryable together, they give you genuine understanding of how your distributed system behaves under real conditions.