Distributed Tracing: How to Find Where Your Request Actually Failed

by Arif Ikhsanudin, Backend Developer

The debugging experience without tracing

A user files a support ticket: "My checkout failed at 14:32 on Tuesday." You look in Order Service logs. You find an error, but the error message is "upstream service error." You ask the Inventory team to check their logs. They find a timeout in the DB query logs. Was that the cause? You check the database slow query log. The timestamps don't quite align. You're not sure if you're looking at the same request or a different one from around the same time.

Thirty minutes later, with input from three teams and five log files, you have a theory about what happened. You're not certain.

This is the debugging experience in microservices without distributed tracing — and it's the reason distributed tracing is not optional infrastructure. It is the minimum viable observability in a system where a single user request crosses multiple service boundaries.

How distributed tracing works

The core concept: every request gets a unique trace ID when it enters the system. That trace ID is propagated through every service-to-service call via HTTP headers. Each service records spans — timed operations within the service — tagged with the trace ID. A tracing backend collects these spans and assembles them into a complete trace: a timeline showing which services handled the request, in what order, and how long each step took.

The W3C Trace Context specification (RFC defined in traceparent and tracestate headers) is the modern standard for trace ID propagation. OpenTelemetry is the standard instrumentation library that implements it.

// Spring Boot with OpenTelemetry auto-instrumentation
// No code changes needed — configure via agent at startup

// In Dockerfile or deployment:
// JAVA_OPTS="-javaagent:/otel-javaagent.jar"
// OTEL_SERVICE_NAME=order-service
// OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
// OTEL_TRACES_EXPORTER=otlp

The OpenTelemetry Java agent auto-instruments Spring Boot, JDBC, Kafka clients, and HTTP clients — no manual span creation required for common operations. The agent injects and propagates trace context automatically.

What a trace shows you

In Jaeger or Grafana Tempo (two common backends for OpenTelemetry traces), a trace looks like a Gantt chart: horizontal bars representing spans, nested to show parent-child relationships, with timestamps and duration.

[Order Service] POST /orders                    0ms - 245ms
  [Order Service] validate request              0ms - 5ms
  [Order Service] HTTP GET /users/{id}          5ms - 45ms
    [User Service] GET /users/123               5ms - 45ms
      [User Service] DB SELECT users            8ms - 42ms  ← 34ms query
  [Order Service] HTTP POST /inventory/reserve  45ms - 240ms
    [Inventory Service] POST /reserve           45ms - 240ms
      [Inventory Service] DB UPDATE inventory   48ms - 238ms ← 190ms, lock wait

From this trace, you can see immediately: the slow database query in Inventory Service caused a 190ms lock wait, which is the dominant factor in the total 245ms request time. Without the trace, you'd be looking at Order Service logs showing a 240ms request time with no internal detail.

Sampling strategy

Recording every span for every request in a high-traffic system is expensive. Sampling is the practice of recording only a fraction of traces.

Head-based sampling (decide at trace start): sample 10% of all requests. Simple but means failures — which you most want to trace — are sampled at the same rate as successful requests and may not be captured.

Tail-based sampling (decide after request completes): sample 100% of traces with errors or high latency, and 1% of everything else. This captures exactly the cases you care about most. Requires a trace collector that can buffer spans and apply the sampling decision after the fact (OpenTelemetry Collector with tail-sampling processor, or Grafana Tempo).

# OpenTelemetry Collector: tail-based sampling config
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
    - name: errors-policy
      type: status_code
      status_code: {status_codes: [ERROR]}
    - name: slow-traces-policy
      type: latency
      latency: {threshold_ms: 1000}
    - name: probabilistic-policy
      type: probabilistic
      probabilistic: {sampling_percentage: 1}

For most teams: start with 10% head-based sampling. Move to tail-based sampling once your tracing infrastructure is stable and you understand the data volume.

Adding custom spans and attributes

Auto-instrumentation covers framework-level operations. For business-logic-level visibility — "why did the inventory reservation fail for this specific item?" — add custom spans:

Tracer tracer = GlobalOpenTelemetry.getTracer("inventory-service");

Span span = tracer.spanBuilder("inventory.reserve")
    .setAttribute("item.id", itemId)
    .setAttribute("requested.quantity", quantity)
    .startSpan();

try (Scope scope = span.makeCurrent()) {
    int reserved = inventoryRepository.reserve(itemId, quantity);
    span.setAttribute("reserved.quantity", reserved);
    span.setAttribute("reservation.success", reserved >= quantity);
} catch (Exception e) {
    span.recordException(e);
    span.setStatus(StatusCode.ERROR);
    throw e;
} finally {
    span.end();
}

Custom attributes let you search traces by business attributes — "show me all traces where item.id = sku-789 and reservation.success = false" — which transforms debugging from timestamp archaeology to targeted query.

Start with tracing on your most critical user paths. Once the infrastructure is in place and teams are familiar with reading traces, expand coverage. The infrastructure investment is front-loaded; the debugging time savings accrue continuously.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

Git Is Not Just a Backup Tool. Here Is What It Actually Is.

Most developers use Git as a glorified save button. Understanding what Git actually models — a directed acyclic graph of snapshots — changes how you use every command.

Read more

The Difference Between a Senior Developer and a Mature One

Seniority is a measure of technical depth. Maturity is something else entirely — and it's the thing that actually makes you useful on a team.

Read more

Why Chicago Startups Are Rethinking the Full-Time Backend Hire and Winning With Async Contractors

Some Chicago startups have stopped competing for senior backend engineers in a market that favors their biggest competitors. Here's what they're doing instead.

Read more

The Performance Bug That Only Appears Under Real Traffic

Some performance problems are invisible in staging, invisible in load tests, and only surface under the specific combination of data, concurrency, and access patterns that real users produce. Here is how to find them before they become incidents, and how to diagnose them when they do.

Read more