Distributed Tracing: How to Find Where Your Request Actually Failed

February 4, 2026

by Arif Ikhsanudin, Backend Developer

The debugging experience without tracing

A user files a support ticket: "My checkout failed at 14:32 on Tuesday." You look in Order Service logs. You find an error, but the error message is "upstream service error." You ask the Inventory team to check their logs. They find a timeout in the DB query logs. Was that the cause? You check the database slow query log. The timestamps don't quite align. You're not sure if you're looking at the same request or a different one from around the same time.

Thirty minutes later, with input from three teams and five log files, you have a theory about what happened. You're not certain.

This is the debugging experience in microservices without distributed tracing — and it's the reason distributed tracing is not optional infrastructure. It is the minimum viable observability in a system where a single user request crosses multiple service boundaries.

How distributed tracing works

The core concept: every request gets a unique trace ID when it enters the system. That trace ID is propagated through every service-to-service call via HTTP headers. Each service records spans — timed operations within the service — tagged with the trace ID. A tracing backend collects these spans and assembles them into a complete trace: a timeline showing which services handled the request, in what order, and how long each step took.

The W3C Trace Context specification (RFC defined in traceparent and tracestate headers) is the modern standard for trace ID propagation. OpenTelemetry is the standard instrumentation library that implements it.

// Spring Boot with OpenTelemetry auto-instrumentation
// No code changes needed — configure via agent at startup

// In Dockerfile or deployment:
// JAVA_OPTS="-javaagent:/otel-javaagent.jar"
// OTEL_SERVICE_NAME=order-service
// OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
// OTEL_TRACES_EXPORTER=otlp

The OpenTelemetry Java agent auto-instruments Spring Boot, JDBC, Kafka clients, and HTTP clients — no manual span creation required for common operations. The agent injects and propagates trace context automatically.

What a trace shows you

In Jaeger or Grafana Tempo (two common backends for OpenTelemetry traces), a trace looks like a Gantt chart: horizontal bars representing spans, nested to show parent-child relationships, with timestamps and duration.

[Order Service] POST /orders                    0ms - 245ms
  [Order Service] validate request              0ms - 5ms
  [Order Service] HTTP GET /users/{id}          5ms - 45ms
    [User Service] GET /users/123               5ms - 45ms
      [User Service] DB SELECT users            8ms - 42ms  ← 34ms query
  [Order Service] HTTP POST /inventory/reserve  45ms - 240ms
    [Inventory Service] POST /reserve           45ms - 240ms
      [Inventory Service] DB UPDATE inventory   48ms - 238ms ← 190ms, lock wait

From this trace, you can see immediately: the slow database query in Inventory Service caused a 190ms lock wait, which is the dominant factor in the total 245ms request time. Without the trace, you'd be looking at Order Service logs showing a 240ms request time with no internal detail.

Sampling strategy

Recording every span for every request in a high-traffic system is expensive. Sampling is the practice of recording only a fraction of traces.

Head-based sampling (decide at trace start): sample 10% of all requests. Simple but means failures — which you most want to trace — are sampled at the same rate as successful requests and may not be captured.

Tail-based sampling (decide after request completes): sample 100% of traces with errors or high latency, and 1% of everything else. This captures exactly the cases you care about most. Requires a trace collector that can buffer spans and apply the sampling decision after the fact (OpenTelemetry Collector with tail-sampling processor, or Grafana Tempo).

# OpenTelemetry Collector: tail-based sampling config
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
    - name: errors-policy
      type: status_code
      status_code: {status_codes: [ERROR]}
    - name: slow-traces-policy
      type: latency
      latency: {threshold_ms: 1000}
    - name: probabilistic-policy
      type: probabilistic
      probabilistic: {sampling_percentage: 1}

For most teams: start with 10% head-based sampling. Move to tail-based sampling once your tracing infrastructure is stable and you understand the data volume.

Adding custom spans and attributes

Auto-instrumentation covers framework-level operations. For business-logic-level visibility — "why did the inventory reservation fail for this specific item?" — add custom spans:

Tracer tracer = GlobalOpenTelemetry.getTracer("inventory-service");

Span span = tracer.spanBuilder("inventory.reserve")
    .setAttribute("item.id", itemId)
    .setAttribute("requested.quantity", quantity)
    .startSpan();

try (Scope scope = span.makeCurrent()) {
    int reserved = inventoryRepository.reserve(itemId, quantity);
    span.setAttribute("reserved.quantity", reserved);
    span.setAttribute("reservation.success", reserved >= quantity);
} catch (Exception e) {
    span.recordException(e);
    span.setStatus(StatusCode.ERROR);
    throw e;
} finally {
    span.end();
}

Custom attributes let you search traces by business attributes — "show me all traces where item.id = sku-789 and reservation.success = false" — which transforms debugging from timestamp archaeology to targeted query.

Start with tracing on your most critical user paths. Once the infrastructure is in place and teams are familiar with reading traces, expand coverage. The infrastructure investment is front-loaded; the debugging time savings accrue continuously.

Our offices

Follow us

Distributed Tracing: How to Find Where Your Request Actually Failed

The debugging experience without tracing

How distributed tracing works

What a trace shows you

Sampling strategy

Adding custom spans and attributes

Scale Your Backend - Need an Experienced Backend Developer?

Tell us about your project

Our offices

More articles

When WFH Is Banned but Productivity Suffers

Stop Guessing Why Your Query Is Slow. Use EXPLAIN.

Distributed Caching With Redis in Spring Boot — Beyond the Basics

Why Backend Engineers Often Become the Most Overloaded People in a Team