Distributed Tracing: How to Find Where Your Request Actually Failed
by Arif Ikhsanudin, Backend Developer
The debugging experience without tracing
A user files a support ticket: "My checkout failed at 14:32 on Tuesday." You look in Order Service logs. You find an error, but the error message is "upstream service error." You ask the Inventory team to check their logs. They find a timeout in the DB query logs. Was that the cause? You check the database slow query log. The timestamps don't quite align. You're not sure if you're looking at the same request or a different one from around the same time.
Thirty minutes later, with input from three teams and five log files, you have a theory about what happened. You're not certain.
This is the debugging experience in microservices without distributed tracing — and it's the reason distributed tracing is not optional infrastructure. It is the minimum viable observability in a system where a single user request crosses multiple service boundaries.
How distributed tracing works
The core concept: every request gets a unique trace ID when it enters the system. That trace ID is propagated through every service-to-service call via HTTP headers. Each service records spans — timed operations within the service — tagged with the trace ID. A tracing backend collects these spans and assembles them into a complete trace: a timeline showing which services handled the request, in what order, and how long each step took.
The W3C Trace Context specification (RFC defined in traceparent and tracestate headers) is the modern standard for trace ID propagation. OpenTelemetry is the standard instrumentation library that implements it.
// Spring Boot with OpenTelemetry auto-instrumentation
// No code changes needed — configure via agent at startup
// In Dockerfile or deployment:
// JAVA_OPTS="-javaagent:/otel-javaagent.jar"
// OTEL_SERVICE_NAME=order-service
// OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
// OTEL_TRACES_EXPORTER=otlp
The OpenTelemetry Java agent auto-instruments Spring Boot, JDBC, Kafka clients, and HTTP clients — no manual span creation required for common operations. The agent injects and propagates trace context automatically.
What a trace shows you
In Jaeger or Grafana Tempo (two common backends for OpenTelemetry traces), a trace looks like a Gantt chart: horizontal bars representing spans, nested to show parent-child relationships, with timestamps and duration.
[Order Service] POST /orders 0ms - 245ms
[Order Service] validate request 0ms - 5ms
[Order Service] HTTP GET /users/{id} 5ms - 45ms
[User Service] GET /users/123 5ms - 45ms
[User Service] DB SELECT users 8ms - 42ms ← 34ms query
[Order Service] HTTP POST /inventory/reserve 45ms - 240ms
[Inventory Service] POST /reserve 45ms - 240ms
[Inventory Service] DB UPDATE inventory 48ms - 238ms ← 190ms, lock wait
From this trace, you can see immediately: the slow database query in Inventory Service caused a 190ms lock wait, which is the dominant factor in the total 245ms request time. Without the trace, you'd be looking at Order Service logs showing a 240ms request time with no internal detail.
Sampling strategy
Recording every span for every request in a high-traffic system is expensive. Sampling is the practice of recording only a fraction of traces.
Head-based sampling (decide at trace start): sample 10% of all requests. Simple but means failures — which you most want to trace — are sampled at the same rate as successful requests and may not be captured.
Tail-based sampling (decide after request completes): sample 100% of traces with errors or high latency, and 1% of everything else. This captures exactly the cases you care about most. Requires a trace collector that can buffer spans and apply the sampling decision after the fact (OpenTelemetry Collector with tail-sampling processor, or Grafana Tempo).
# OpenTelemetry Collector: tail-based sampling config
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: errors-policy
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow-traces-policy
type: latency
latency: {threshold_ms: 1000}
- name: probabilistic-policy
type: probabilistic
probabilistic: {sampling_percentage: 1}
For most teams: start with 10% head-based sampling. Move to tail-based sampling once your tracing infrastructure is stable and you understand the data volume.
Adding custom spans and attributes
Auto-instrumentation covers framework-level operations. For business-logic-level visibility — "why did the inventory reservation fail for this specific item?" — add custom spans:
Tracer tracer = GlobalOpenTelemetry.getTracer("inventory-service");
Span span = tracer.spanBuilder("inventory.reserve")
.setAttribute("item.id", itemId)
.setAttribute("requested.quantity", quantity)
.startSpan();
try (Scope scope = span.makeCurrent()) {
int reserved = inventoryRepository.reserve(itemId, quantity);
span.setAttribute("reserved.quantity", reserved);
span.setAttribute("reservation.success", reserved >= quantity);
} catch (Exception e) {
span.recordException(e);
span.setStatus(StatusCode.ERROR);
throw e;
} finally {
span.end();
}
Custom attributes let you search traces by business attributes — "show me all traces where item.id = sku-789 and reservation.success = false" — which transforms debugging from timestamp archaeology to targeted query.
Start with tracing on your most critical user paths. Once the infrastructure is in place and teams are familiar with reading traces, expand coverage. The infrastructure investment is front-loaded; the debugging time savings accrue continuously.