Production-Ready Spring Boot — The Observability Setup That Catches Problems Before Users Do

by Eric Hanson, Backend Developer at Clean Systems Consulting

The gap between running and observable

An application that starts and responds to requests is running. An application where you can answer "is it healthy?", "what is it doing right now?", "what happened five minutes ago when that error spiked?", and "which service caused that slow request?" — that's observable.

Most Spring Boot setups have the first. Getting to the second requires deliberate configuration of four things: health indicators, structured logs, metrics, and distributed traces. Spring Boot's ecosystem covers all four; the defaults cover only some of them.

Health checks — what to expose and what to hide

Spring Boot Actuator provides /actuator/health out of the box. The default configuration exposes an aggregate status — UP, DOWN, OUT_OF_SERVICE, UNKNOWN — and hides the detail.

Configure what to expose at each endpoint:

management:
  endpoints:
    web:
      exposure:
        include: health, info, metrics, prometheus
  endpoint:
    health:
      show-details: when-authorized  # or 'always' for internal services
      show-components: when-authorized
  health:
    db:
      enabled: true
    redis:
      enabled: true
    diskspace:
      enabled: true
      threshold: 524288000  # 500MB minimum free space

Never expose all actuator endpoints publicly. /actuator/env exposes environment variables including secrets. /actuator/loggers allows changing log levels at runtime — useful in production but only for authorized users. /actuator/heapdump triggers a heap dump — a denial-of-service vector if exposed publicly. Expose health, info, metrics, and prometheus publicly; gate everything else behind authentication.

Custom health indicators for critical dependencies your application needs to function:

@Component
public class PaymentGatewayHealthIndicator implements HealthIndicator {
    private final PaymentGatewayClient client;

    public PaymentGatewayHealthIndicator(PaymentGatewayClient client) {
        this.client = client;
    }

    @Override
    public Health health() {
        try {
            boolean reachable = client.ping();
            if (reachable) {
                return Health.up()
                    .withDetail("gateway", "stripe")
                    .withDetail("latency_ms", client.lastPingLatencyMs())
                    .build();
            }
            return Health.down()
                .withDetail("gateway", "stripe")
                .withDetail("reason", "ping failed")
                .build();
        } catch (Exception e) {
            return Health.down(e).build();
        }
    }
}

The health endpoint is the contract your load balancer and orchestration platform uses. Kubernetes liveness and readiness probes should target separate endpoints:

management:
  endpoint:
    health:
      probes:
        enabled: true
# Exposes /actuator/health/liveness and /actuator/health/readiness

Liveness: is the application alive? If not, restart it. Should only fail for unrecoverable states — deadlock, corrupted state — not for external dependency failures.

Readiness: is the application ready to accept traffic? Should fail if critical dependencies (database, message queue) are unavailable. External dependency health indicators belong in the readiness group:

@Component
public class DatabaseReadinessIndicator implements HealthIndicator {
    // ... check database connectivity
}

// Register as readiness probe
@Bean
public HealthContributorRegistry readinessHealthContributors(
        DatabaseReadinessIndicator dbIndicator) {
    // Spring Boot auto-configures this; use @ReadinessProbe annotation
    return ...;
}

In Spring Boot 2.3+, annotate components with @Liveness or @Readiness or use the application properties to configure which indicators contribute to each probe.

Structured logging

Unstructured log lines — 2024-01-15 ERROR OrderService: Failed to process order 123 — require regex parsing to extract fields for alerting and querying. Structured logs emit JSON or key-value pairs that log aggregators can index directly.

Configure Logback for JSON output with Logstash encoder:

<!-- logback-spring.xml -->
<configuration>
  <springProfile name="production">
    <appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
      <encoder class="net.logstash.logback.encoder.LogstashEncoder">
        <includeMdcKeyName>traceId</includeMdcKeyName>
        <includeMdcKeyName>spanId</includeMdcKeyName>
        <includeMdcKeyName>userId</includeMdcKeyName>
        <includeMdcKeyName>requestId</includeMdcKeyName>
      </encoder>
    </appender>
    <root level="INFO">
      <appender-ref ref="JSON" />
    </root>
  </springProfile>
</configuration>

MDC (Mapped Diagnostic Context) fields are injected per-request and appear in every log line for that request. Populate MDC at request entry:

@Component
public class LoggingFilter extends OncePerRequestFilter {
    @Override
    protected void doFilterInternal(HttpServletRequest request,
            HttpServletResponse response, FilterChain chain)
            throws ServletException, IOException {
        try {
            MDC.put("requestId", UUID.randomUUID().toString());
            MDC.put("method", request.getMethod());
            MDC.put("path", request.getRequestURI());
            chain.doFilter(request, response);
        } finally {
            MDC.clear(); // always clear — virtual threads and thread pools share threads
        }
    }
}

MDC.clear() in finally is critical. In thread pool environments, threads are reused — MDC from a previous request will leak into the next request on the same thread if not cleared.

Log level discipline. Production logs at INFO. DEBUG logs are noise in production and CPU overhead in high-throughput services (string construction happens before the level check for non-SLF4J-parameterized calls). Set specific packages to DEBUG via Actuator when diagnosing a live issue — don't leave them there:

curl -X POST localhost:8080/actuator/loggers/com.example.payments \
  -H "Content-Type: application/json" \
  -d '{"configuredLevel": "DEBUG"}'

This changes the log level at runtime without restart. Revert after diagnosis.

Metrics with Micrometer

Micrometer is Spring Boot's metrics facade — it exposes metrics in any format (Prometheus, Datadog, CloudWatch, InfluxDB) via a pluggable registry. Spring Boot auto-configures JVM, HTTP, and database metrics:

<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
management:
  metrics:
    tags:
      application: ${spring.application.name}
      environment: ${spring.profiles.active}
    distribution:
      percentiles-histogram:
        http.server.requests: true  # enables histogram for percentile calculation
      percentiles:
        http.server.requests: 0.5, 0.95, 0.99

Adding application and environment tags to all metrics makes filtering in dashboards trivial.

The metrics that warrant alerts:

# HTTP
http.server.requests{status="5xx"} — error rate, alert on > 1% of requests
http.server.requests{outcome="SUCCESS", quantile="0.99"} — p99 latency, alert on > SLA

# JVM
jvm.memory.used{area="heap"} / jvm.memory.max{area="heap"} — heap usage ratio, alert on > 80%
jvm.gc.pause_seconds_max — worst GC pause, alert on > 200ms
jvm.threads.states{state="blocked"} — blocked threads, alert on sustained count > 0

# Database (HikariCP)
hikaricp.connections.pending — connection pool wait, alert on > 0 sustained
hikaricp.connections.timeout — pool timeout events, alert on any
hikaricp.connections.active / hikaricp.connections.max — pool utilization

# Application-specific
business.orders.processed.total — throughput, alert on sudden drop
business.payment.failures.total — payment failures, alert on rate increase

Custom metrics:

@Service
public class OrderService {
    private final Counter orderCounter;
    private final Timer processingTimer;

    public OrderService(MeterRegistry registry) {
        this.orderCounter = Counter.builder("business.orders.processed")
            .description("Total orders processed")
            .tag("environment", "production")
            .register(registry);

        this.processingTimer = Timer.builder("business.order.processing.duration")
            .description("Order processing duration")
            .publishPercentileHistogram()
            .register(registry);
    }

    public void processOrder(Order order) {
        processingTimer.record(() -> {
            doProcess(order);
            orderCounter.increment();
        });
    }
}

Timers with publishPercentileHistogram() enable server-side percentile calculation in Prometheus/Grafana. Without the histogram, only mean and max are computable.

Distributed tracing with Micrometer Tracing

Micrometer Tracing (Spring Boot 3.x, formerly Spring Cloud Sleuth) automatically instruments Spring MVC, Spring WebFlux, Spring Data, and messaging. Add the dependency for your tracing backend:

<!-- For Zipkin -->
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-tracing-bridge-brave</artifactId>
</dependency>
<dependency>
    <groupId>io.zipkin.reporter2</groupId>
    <artifactId>zipkin-reporter-brave</artifactId>
</dependency>

<!-- For OpenTelemetry (preferred for modern setups) -->
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-exporter-otlp</artifactId>
</dependency>
management:
  tracing:
    sampling:
      probability: 0.1  # sample 10% of requests — adjust based on volume

With tracing configured, every HTTP request gets a traceId that propagates through all downstream service calls. The traceId appears in logs (via MDC injection), in metrics tags, and in the trace viewer (Zipkin, Jaeger, Grafana Tempo). A single traceId from a user report lets you reconstruct the entire request path across services.

Sampling rate. 100% sampling for low-volume services; 1–10% for high-volume. Store the traceId in error responses so users can report it — even at 10% sampling, errors can be sampled at 100% with a custom sampler:

@Bean
public Sampler customSampler() {
    return (traceContext, sampled) ->
        // Always sample on errors, 10% otherwise
        sampled ? Sampler.ALWAYS_SAMPLE.isSampled() : new RateLimitingSampler(10).isSampled();
}

The startup verification checklist

Before declaring a service production-ready:

  • /actuator/health returns UP and shows component detail for authorized requests
  • /actuator/health/liveness and /actuator/health/readiness exist and return correct status
  • Custom health indicators cover all critical external dependencies
  • Logs are structured JSON in production profile, unstructured in development
  • MDC includes requestId, traceId, and spanId in every log line
  • /actuator/prometheus exposes metrics and is scraped by the metrics system
  • HTTP error rate and p99 latency alerts are configured
  • HikariCP connection pool metrics are being collected
  • Distributed trace IDs appear in error log lines and error responses
  • Log level can be changed at runtime via Actuator without restart

Each item on this list represents a question you'll need to answer during an incident. Missing any of them means that question goes unanswered — at the worst possible time.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

Why “Hero Developers” Are Dangerous for Engineering Teams

Everyone loves a “rockstar” developer—until the team starts tripping over their code. Hero developers can quietly become the biggest risk to a project.

Read more

OpenAPI Specs: The Documentation Format Worth Getting Right From the Start

An OpenAPI spec done well is a contract, a test harness, and an SDK generator. An OpenAPI spec done poorly is a documentation burden that diverges from reality within weeks.

Read more

How I Handle Authentication in Rails API Mode Without Overcomplicating It

JWT, sessions, Devise, OAuth — Rails API authentication has more options than decisions that need making. Here is a clear-eyed breakdown of what to use when and how to implement it without pulling in more than you need.

Read more

When Big Tech Owns Your Local Talent Pool — How Dublin Startups Hire Backend Engineers

You offered the role to your top candidate on Friday. On Monday she emailed to say Google matched. You never heard from her again.

Read more