Your Pipeline Is Flaky and That Is a Bigger Problem Than You Think

March 14, 2026

by Eric Hanson, Backend Developer at Clean Systems Consulting

The Red Build Nobody Investigates

Your pipeline fails. A developer looks at the job name and the stage, pulls up the logs, sees "connection refused" on a Testcontainer startup, and clicks retry. Forty seconds later, it's green. They move on. This happened three times yesterday. Nobody filed a ticket.

This is the slow death of your CI system. Not through a catastrophic failure — through accumulated tolerance for failures that "don't count." The team has learned that some red builds are real (code problem) and some are noise (environment problem), and they've learned to distinguish between them by feel rather than by pipeline reliability. The moment that distinction becomes learned behavior, your pipeline has stopped being a reliable safety net.

Why Flakiness Is a Trust Problem, Not a Time Problem

The obvious cost of a flaky test is the time spent on retries. A pipeline that runs 30 times a day with a 5% flake rate wastes 1.5 pipeline runs per day on retries — annoying but not catastrophic.

The hidden cost is that developers learn to discount red builds. Once the team accepts "that's probably just flakiness" as a valid response to a failure, every genuine failure has to compete with that assumption. How many times will a developer retry a real regression before concluding it's a flake? Once? Twice? The answer depends on how often retrying worked before — which is exactly what a high flake rate trains them to expect.

In high-flake environments, genuine regressions get merged. Not because developers are careless, but because the pipeline has taught them that red doesn't mean broken.

The Common Sources and Their Fixes

Time-dependent tests are the most common and most fixable. Any test that calls new Date(), System.currentTimeMillis(), or Instant.now() directly is potentially flaky if it asserts on timing or relies on a specific temporal state.

// Flaky: behavior changes based on when the test runs
@Test
void shouldRejectExpiredToken() {
    Token token = new Token(Instant.now().minusSeconds(5));
    assertTrue(token.isExpired()); // passes if run fast enough, fails if slow
}

// Stable: inject a controllable clock
@Test
void shouldRejectExpiredToken() {
    Clock fixed = Clock.fixed(Instant.parse("2026-04-25T10:00:00Z"), UTC);
    Token token = new Token(Instant.parse("2026-04-25T09:59:54Z"), fixed);
    assertTrue(token.isExpired(fixed.instant()));
}

External service dependencies are the second largest source. Tests that hit real HTTP endpoints, real databases, or real message brokers are subject to network variability, service availability, and rate limiting. Mock external services with WireMock for HTTP, use Testcontainers for databases but with properly configured startup health checks, and use in-memory implementations (like an embedded Kafka) for message brokers in unit tests.

// Testcontainers with proper startup guarantee
@Container
static PostgreSQLContainer<?> postgres = new PostgreSQLContainer<>("postgres:16")
    .withStartupTimeout(Duration.ofSeconds(60))
    .waitingFor(Wait.forHealthcheck());  // Don't just wait for port — wait for readiness

Shared mutable state between tests causes interference that depends on execution order. This is particularly common in Spring Boot integration tests that share an application context with mutable singletons or caches. Use @DirtiesContext to force context reload when state modification is unavoidable, or redesign tests to be state-independent.

Resource contention on CI runners — tests that bind to specific ports, write to specific file paths, or allocate more memory than available on the runner. Use random port allocation (port: 0 in Spring Boot tests), temp directories, and check runner memory specs against what your tests actually need.

Tracking Flakiness

You can't fix what you're not measuring. Set up flake tracking before optimizing:

# Simplified flake detection: runs that failed, then passed on retry
# Query your CI API for the last 30 days of runs
# Flag any run where: final_status == 'success' AND any_prior_attempt_status == 'failed'

flake_rate = flaky_runs / total_runs

# Per-test: if you export test results as JUnit XML, aggregate across runs
# A test that shows both PASS and FAIL in the last 100 runs is flaky

Most CI platforms (GitHub Actions, CircleCI, BuildKite) have built-in test insights that show flaky tests over time. If yours doesn't, export JUnit XML from your test runner and aggregate it externally.

Set a target — 1% flake rate across all pipeline runs — and treat exceeding it as a P2 incident. Not a someday cleanup task. An active incident with an owner and a resolution date.

The Policy That Accelerates Fixing

The most effective policy for eliminating flakiness is simple: any test that flakes twice in a week gets quarantined (moved to a non-blocking suite) within 24 hours, and gets fixed or deleted within two weeks. Quarantined means it still runs but doesn't block merging, so flakiness doesn't propagate into developer workflow while the fix is in progress.

This policy forces a decision: fix the test or delete it. Both are better than a flaky test in a blocking suite. The tests that "we should fix someday" never get fixed. The tests in a quarantine queue with a deadline do.

Our offices

Follow us

Your Pipeline Is Flaky and That Is a Bigger Problem Than You Think

The Red Build Nobody Investigates

Why Flakiness Is a Trust Problem, Not a Time Problem

The Common Sources and Their Fixes

Tracking Flakiness

The Policy That Accelerates Fixing

Scale Your Backend - Need an Experienced Backend Developer?

Tell us about your project

Our offices

More articles

From Figma to API: A Structured Backend Development Process

When One Developer Chooses a Technology Nobody Else Understands

Turning One Contract Into a Long Term Relationship

Ruby Modules and Mixins — Composition Over Inheritance in Practice