Production-Ready Spring Boot — The Observability Setup That Catches Problems Before Users Do
by Eric Hanson, Backend Developer at Clean Systems Consulting
The gap between running and observable
An application that starts and responds to requests is running. An application where you can answer "is it healthy?", "what is it doing right now?", "what happened five minutes ago when that error spiked?", and "which service caused that slow request?" — that's observable.
Most Spring Boot setups have the first. Getting to the second requires deliberate configuration of four things: health indicators, structured logs, metrics, and distributed traces. Spring Boot's ecosystem covers all four; the defaults cover only some of them.
Health checks — what to expose and what to hide
Spring Boot Actuator provides /actuator/health out of the box. The default configuration exposes an aggregate status — UP, DOWN, OUT_OF_SERVICE, UNKNOWN — and hides the detail.
Configure what to expose at each endpoint:
management:
endpoints:
web:
exposure:
include: health, info, metrics, prometheus
endpoint:
health:
show-details: when-authorized # or 'always' for internal services
show-components: when-authorized
health:
db:
enabled: true
redis:
enabled: true
diskspace:
enabled: true
threshold: 524288000 # 500MB minimum free space
Never expose all actuator endpoints publicly. /actuator/env exposes environment variables including secrets. /actuator/loggers allows changing log levels at runtime — useful in production but only for authorized users. /actuator/heapdump triggers a heap dump — a denial-of-service vector if exposed publicly. Expose health, info, metrics, and prometheus publicly; gate everything else behind authentication.
Custom health indicators for critical dependencies your application needs to function:
@Component
public class PaymentGatewayHealthIndicator implements HealthIndicator {
private final PaymentGatewayClient client;
public PaymentGatewayHealthIndicator(PaymentGatewayClient client) {
this.client = client;
}
@Override
public Health health() {
try {
boolean reachable = client.ping();
if (reachable) {
return Health.up()
.withDetail("gateway", "stripe")
.withDetail("latency_ms", client.lastPingLatencyMs())
.build();
}
return Health.down()
.withDetail("gateway", "stripe")
.withDetail("reason", "ping failed")
.build();
} catch (Exception e) {
return Health.down(e).build();
}
}
}
The health endpoint is the contract your load balancer and orchestration platform uses. Kubernetes liveness and readiness probes should target separate endpoints:
management:
endpoint:
health:
probes:
enabled: true
# Exposes /actuator/health/liveness and /actuator/health/readiness
Liveness: is the application alive? If not, restart it. Should only fail for unrecoverable states — deadlock, corrupted state — not for external dependency failures.
Readiness: is the application ready to accept traffic? Should fail if critical dependencies (database, message queue) are unavailable. External dependency health indicators belong in the readiness group:
@Component
public class DatabaseReadinessIndicator implements HealthIndicator {
// ... check database connectivity
}
// Register as readiness probe
@Bean
public HealthContributorRegistry readinessHealthContributors(
DatabaseReadinessIndicator dbIndicator) {
// Spring Boot auto-configures this; use @ReadinessProbe annotation
return ...;
}
In Spring Boot 2.3+, annotate components with @Liveness or @Readiness or use the application properties to configure which indicators contribute to each probe.
Structured logging
Unstructured log lines — 2024-01-15 ERROR OrderService: Failed to process order 123 — require regex parsing to extract fields for alerting and querying. Structured logs emit JSON or key-value pairs that log aggregators can index directly.
Configure Logback for JSON output with Logstash encoder:
<!-- logback-spring.xml -->
<configuration>
<springProfile name="production">
<appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
<encoder class="net.logstash.logback.encoder.LogstashEncoder">
<includeMdcKeyName>traceId</includeMdcKeyName>
<includeMdcKeyName>spanId</includeMdcKeyName>
<includeMdcKeyName>userId</includeMdcKeyName>
<includeMdcKeyName>requestId</includeMdcKeyName>
</encoder>
</appender>
<root level="INFO">
<appender-ref ref="JSON" />
</root>
</springProfile>
</configuration>
MDC (Mapped Diagnostic Context) fields are injected per-request and appear in every log line for that request. Populate MDC at request entry:
@Component
public class LoggingFilter extends OncePerRequestFilter {
@Override
protected void doFilterInternal(HttpServletRequest request,
HttpServletResponse response, FilterChain chain)
throws ServletException, IOException {
try {
MDC.put("requestId", UUID.randomUUID().toString());
MDC.put("method", request.getMethod());
MDC.put("path", request.getRequestURI());
chain.doFilter(request, response);
} finally {
MDC.clear(); // always clear — virtual threads and thread pools share threads
}
}
}
MDC.clear() in finally is critical. In thread pool environments, threads are reused — MDC from a previous request will leak into the next request on the same thread if not cleared.
Log level discipline. Production logs at INFO. DEBUG logs are noise in production and CPU overhead in high-throughput services (string construction happens before the level check for non-SLF4J-parameterized calls). Set specific packages to DEBUG via Actuator when diagnosing a live issue — don't leave them there:
curl -X POST localhost:8080/actuator/loggers/com.example.payments \
-H "Content-Type: application/json" \
-d '{"configuredLevel": "DEBUG"}'
This changes the log level at runtime without restart. Revert after diagnosis.
Metrics with Micrometer
Micrometer is Spring Boot's metrics facade — it exposes metrics in any format (Prometheus, Datadog, CloudWatch, InfluxDB) via a pluggable registry. Spring Boot auto-configures JVM, HTTP, and database metrics:
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
management:
metrics:
tags:
application: ${spring.application.name}
environment: ${spring.profiles.active}
distribution:
percentiles-histogram:
http.server.requests: true # enables histogram for percentile calculation
percentiles:
http.server.requests: 0.5, 0.95, 0.99
Adding application and environment tags to all metrics makes filtering in dashboards trivial.
The metrics that warrant alerts:
# HTTP
http.server.requests{status="5xx"} — error rate, alert on > 1% of requests
http.server.requests{outcome="SUCCESS", quantile="0.99"} — p99 latency, alert on > SLA
# JVM
jvm.memory.used{area="heap"} / jvm.memory.max{area="heap"} — heap usage ratio, alert on > 80%
jvm.gc.pause_seconds_max — worst GC pause, alert on > 200ms
jvm.threads.states{state="blocked"} — blocked threads, alert on sustained count > 0
# Database (HikariCP)
hikaricp.connections.pending — connection pool wait, alert on > 0 sustained
hikaricp.connections.timeout — pool timeout events, alert on any
hikaricp.connections.active / hikaricp.connections.max — pool utilization
# Application-specific
business.orders.processed.total — throughput, alert on sudden drop
business.payment.failures.total — payment failures, alert on rate increase
Custom metrics:
@Service
public class OrderService {
private final Counter orderCounter;
private final Timer processingTimer;
public OrderService(MeterRegistry registry) {
this.orderCounter = Counter.builder("business.orders.processed")
.description("Total orders processed")
.tag("environment", "production")
.register(registry);
this.processingTimer = Timer.builder("business.order.processing.duration")
.description("Order processing duration")
.publishPercentileHistogram()
.register(registry);
}
public void processOrder(Order order) {
processingTimer.record(() -> {
doProcess(order);
orderCounter.increment();
});
}
}
Timers with publishPercentileHistogram() enable server-side percentile calculation in Prometheus/Grafana. Without the histogram, only mean and max are computable.
Distributed tracing with Micrometer Tracing
Micrometer Tracing (Spring Boot 3.x, formerly Spring Cloud Sleuth) automatically instruments Spring MVC, Spring WebFlux, Spring Data, and messaging. Add the dependency for your tracing backend:
<!-- For Zipkin -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-tracing-bridge-brave</artifactId>
</dependency>
<dependency>
<groupId>io.zipkin.reporter2</groupId>
<artifactId>zipkin-reporter-brave</artifactId>
</dependency>
<!-- For OpenTelemetry (preferred for modern setups) -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-exporter-otlp</artifactId>
</dependency>
management:
tracing:
sampling:
probability: 0.1 # sample 10% of requests — adjust based on volume
With tracing configured, every HTTP request gets a traceId that propagates through all downstream service calls. The traceId appears in logs (via MDC injection), in metrics tags, and in the trace viewer (Zipkin, Jaeger, Grafana Tempo). A single traceId from a user report lets you reconstruct the entire request path across services.
Sampling rate. 100% sampling for low-volume services; 1–10% for high-volume. Store the traceId in error responses so users can report it — even at 10% sampling, errors can be sampled at 100% with a custom sampler:
@Bean
public Sampler customSampler() {
return (traceContext, sampled) ->
// Always sample on errors, 10% otherwise
sampled ? Sampler.ALWAYS_SAMPLE.isSampled() : new RateLimitingSampler(10).isSampled();
}
The startup verification checklist
Before declaring a service production-ready:
/actuator/healthreturnsUPand shows component detail for authorized requests/actuator/health/livenessand/actuator/health/readinessexist and return correct status- Custom health indicators cover all critical external dependencies
- Logs are structured JSON in production profile, unstructured in development
- MDC includes
requestId,traceId, andspanIdin every log line /actuator/prometheusexposes metrics and is scraped by the metrics system- HTTP error rate and p99 latency alerts are configured
- HikariCP connection pool metrics are being collected
- Distributed trace IDs appear in error log lines and error responses
- Log level can be changed at runtime via Actuator without restart
Each item on this list represents a question you'll need to answer during an incident. Missing any of them means that question goes unanswered — at the worst possible time.