Observability Is Not Just Logging. Here Is What You Are Missing.

by Arif Ikhsanudin, Backend Developer

Why logs alone aren't enough

Your service emits excellent logs. Structured JSON, correlation IDs, error stacks. When something fails, you can tell exactly what happened inside that service. What you can't answer from logs alone: why is the 95th percentile latency for checkout up 40% today compared to last Tuesday? Which downstream service is responsible? Is this affecting all users or a specific segment? Is this correlated with a deployment that happened two hours ago?

Logs are an event-centric view of a system: things that happened, at specific moments, with context. They're essential but insufficient for understanding system behavior over time, across services, and at scale. Observability — the ability to understand your system's internal state from its external outputs — requires three types of signal that serve different analytical purposes: metrics, logs, and traces.

Metrics: the aggregate view

Metrics are numerical measurements aggregated over time. They answer "how is the system performing?" rather than "what happened in this specific request?" A metric like http_server_requests_seconds is a histogram across thousands of requests, giving you the statistical distribution of response times.

Metrics are cheap to store and query at scale precisely because they're aggregated. You don't store every individual request duration — you store bucket counts in a histogram. Prometheus with Grafana is the standard open-source stack; the Micrometer library in Java exposes Spring Boot metrics in Prometheus format automatically.

What metrics tell you that logs can't:

  • Trend analysis: is latency slowly degrading over days?
  • Capacity planning: at current growth rate, when will you hit connection pool limits?
  • Anomaly detection: is this metric outside its normal range for this time of day?
  • Correlated failures: did error rates in Service A increase when Service B was deployed?
// Custom business metric with Micrometer
Counter orderCounter = Counter.builder("orders.created")
    .tag("payment_method", paymentMethod)
    .tag("region", region)
    .register(meterRegistry);

orderCounter.increment();

This counter, scraped by Prometheus, lets you build dashboards showing order volume by payment method and region — a business signal that pure infrastructure metrics never surface.

Traces: the request-centric view

Traces follow a single request across every service it touches. They answer "where did this specific request spend its time, and where did it fail?" — something neither metrics (aggregated) nor logs (per-service) can answer alone.

The connection between metrics and traces is what makes the combination powerful. A metric alert fires: P99 latency for checkout is above SLA. You open a trace for a recent slow checkout. The trace shows that 80% of the latency is in the Inventory Service's database query. You open the Inventory Service's query performance logs. You find a specific slow query plan. You've moved from system-level alert to specific root cause in three steps, without guessing.

Without traces, the path from "checkout is slow" to "this specific database query is the cause" involves multiple disconnected tools, multiple teams, and significant time. With traces connected to metrics and logs, it's a directed investigation.

The connection layer: correlation IDs and exemplars

The three signals are most powerful when they reference each other. This requires two things:

Correlation IDs in logs and traces: every log line should include the trace ID. In OpenTelemetry, the trace context is available via the current span, and can be injected into your structured logging context:

// Inject trace context into log MDC automatically
// OpenTelemetry agent handles this via the log bridge
// Result: every log line includes trace_id and span_id
{
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "message": "Inventory reservation attempted",
  "item_id": "sku-774",
  "quantity": 2
}

Now you can go from a metric anomaly → find a representative trace → follow the trace ID to specific log lines from every service in that trace.

Exemplars in metrics: Prometheus supports exemplars — attaching a trace ID to a specific metric observation. When investigating a P99 latency anomaly, you can jump directly from the metric to a real trace that contributed to that metric bucket:

// Micrometer exemplar support with OpenTelemetry
Timer.builder("checkout.duration")
    .register(meterRegistry)
    .record(() -> {
        // The current trace ID is automatically attached as an exemplar
        return checkoutService.process(request);
    });

In Grafana, clicking on a metric data point with exemplars shows you a direct link to a trace from that time window. This is the connection that makes "the metric spiked at 14:47" lead directly to "here's a trace from that spike."

The OpenTelemetry stack in practice

OpenTelemetry (OTel) is the CNCF standard for instrumentation, replacing the fragmented landscape of Jaeger clients, Zipkin clients, and proprietary SDKs. One instrumentation library, pluggable backends.

The reference architecture:

  • OpenTelemetry agent (Java, .NET, Python, Node.js) auto-instruments your services for traces and metrics with zero code changes
  • OpenTelemetry Collector receives traces, metrics, and logs; applies processing (sampling, attribute filtering); exports to backends
  • Prometheus for metrics storage and alerting
  • Grafana Tempo or Jaeger for trace storage
  • Grafana Loki or Elasticsearch for log storage
  • Grafana unified dashboard connecting all three

This stack is fully open-source, runs on Kubernetes, and costs significantly less than managed alternatives at moderate scale. At high scale (> 10TB/day of telemetry), managed offerings (Datadog, Honeycomb, Grafana Cloud) become cost-competitive with operational complexity savings.

Observability is not a tool. It's the capability to ask arbitrary questions about your system's behavior and get answers quickly. Logs, metrics, and traces — individually — give you partial answers. Connected and queryable together, they give you genuine understanding of how your distributed system behaves under real conditions.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

Blocks, Procs, and Lambdas — A Practical Guide Without the Confusion

Ruby gives you three ways to package callable code, and most developers cargo-cult the choice. Here's a precise breakdown of the differences that actually affect behavior in production code.

Read more

From Outsider to ‘Employee’: The Danger of Over-Controlled Contractors

“Just follow our internal process and be online during office hours.” That’s usually how a contractor slowly stops feeling like a contractor.

Read more

How to Deliver Bad News Without Panic

Breaking bad news is never fun. Here’s a calm, practical way to handle it without losing your cool.

Read more

Hiring a Senior Backend Engineer in London Takes 10 Weeks. There Is a Faster Way

You posted the job ad six weeks ago. Your backend still isn't built. What if the timeline itself is the problem?

Read more