Logging Across Microservices Is Useless If You Can't Connect the Dots
by Eric Hanson, Backend Developer at Clean Systems Consulting
Why individually good logs fail you at the system level
Each of your services logs well. Order Service uses structured JSON, logs request IDs, logs errors with stack traces. Inventory Service does the same. Payment Service too. Then a checkout fails at 14:47:23, and you search Order Service logs for that timestamp. You find an error. You want to know what Inventory Service was doing when Order Service got that error. You search Inventory Service logs for 14:47:23. You find several requests. You can't tell which one corresponds to the Order Service error you're investigating.
The logs are individually correct and collectively useless for cross-service debugging. The missing ingredient is correlation: a shared ID that travels with a request through every service it touches, so you can retrieve the complete story of that request from any log aggregation system.
Structured logging as the baseline
Before correlation IDs matter, you need structured logs. Logs emitted as plain text strings are not queryable in a useful way. You can grep for error messages, but you can't filter by user_id and status_code simultaneously, or aggregate error rates by endpoint.
Structured logging means emitting JSON (or another structured format) so log aggregation systems can index and query individual fields:
{
"timestamp": "2026-04-25T14:47:23.412Z",
"level": "ERROR",
"service": "order-service",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"user_id": "user-8821",
"order_id": "order-99142",
"message": "Inventory reservation failed",
"error": "InventoryServiceException: item sku-774 out of stock",
"duration_ms": 187
}
In Java, Logback with logstash-logback-encoder or Log4j2 with JsonTemplateLayout produces this format with minimal configuration. In Go, zap or zerolog emit structured JSON by default.
The fields that matter: timestamp (ISO 8601, always UTC), level, service, trace_id, message, and any domain-specific IDs relevant to the operation (user_id, order_id, payment_id).
Correlation ID propagation
A correlation ID (also called trace ID when using distributed tracing) is a unique identifier generated at the edge of your system — at the API gateway or at the first service that handles an external request. It is propagated via HTTP header through every service call downstream:
// Incoming request: extract or generate correlation ID
@Component
public class CorrelationIdFilter extends OncePerRequestFilter {
public static final String CORRELATION_ID_HEADER = "X-Correlation-Id";
@Override
protected void doFilterInternal(HttpServletRequest request,
HttpServletResponse response,
FilterChain chain) throws IOException, ServletException {
String correlationId = Optional
.ofNullable(request.getHeader(CORRELATION_ID_HEADER))
.orElse(UUID.randomUUID().toString());
// Store in MDC so all logs in this thread include it automatically
MDC.put("trace_id", correlationId);
response.setHeader(CORRELATION_ID_HEADER, correlationId);
try {
chain.doFilter(request, response);
} finally {
MDC.remove("trace_id");
}
}
}
// Outgoing service call: forward correlation ID downstream
public class CorrelationIdInterceptor implements RequestInterceptor {
@Override
public void apply(RequestTemplate template) {
String correlationId = MDC.get("trace_id");
if (correlationId != null) {
template.header("X-Correlation-Id", correlationId);
}
}
}
With this in place, every log line from every service that handles a given request includes the same trace_id. Finding all logs for a specific request becomes a single query:
# Elasticsearch/OpenSearch query
{ "query": { "term": { "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736" } } }
What to actually log
More logs are not always better. Log files that contain every method entry and exit, every variable value, and every SQL query are expensive to store, slow to query, and obscure the signals you actually need.
Log at the right level for the right information:
INFO: request received (with method, path, correlation ID), request completed (with status code, duration), significant business events (order placed, payment processed, user registered).
WARN: degraded behavior that is handled (fallback used, retry succeeded, rate limit approaching), configuration that might be wrong, deprecated API versions being called.
ERROR: operation failed in a way that requires attention, unhandled exceptions, dependency failures that triggered circuit breakers.
DEBUG: not in production by default. Enable per-service via dynamic log level adjustment (Spring Boot Actuator's /loggers endpoint) when actively debugging a specific issue.
Log aggregation and the ELK/Grafana stack
Individual service logs are useless if not centrally aggregated and queryable. Fluent Bit (lightweight log forwarder, deploys as DaemonSet on Kubernetes) collects logs from all pods and forwards them to Elasticsearch or OpenSearch. Kibana (or OpenSearch Dashboards) provides the query interface.
The alternative: Grafana Loki (designed for Kubernetes, stores logs with label-based indexing rather than full-text indexing). Loki is cheaper to operate than Elasticsearch for pure log storage, and integrates naturally with Grafana alongside metrics and traces.
Whichever stack you choose: establish a log retention policy before you need it. 30 days of raw logs from a moderately trafficked system can be several terabytes. Hot storage (fast query) for 7 days, warm storage (slower query) for 30 days, and archive for compliance requirements is a common tiering.
The correlation ID strategy is infrastructure work. Do it once, enforce it in your service template (the base configuration every new service starts from), and every new service gets it automatically. The alternative is retrofitting it into every service after you've had the debugging incident that makes you realize you needed it.