Your Pipeline Is Flaky and That Is a Bigger Problem Than You Think

by Eric Hanson, Backend Developer at Clean Systems Consulting

The Red Build Nobody Investigates

Your pipeline fails. A developer looks at the job name and the stage, pulls up the logs, sees "connection refused" on a Testcontainer startup, and clicks retry. Forty seconds later, it's green. They move on. This happened three times yesterday. Nobody filed a ticket.

This is the slow death of your CI system. Not through a catastrophic failure — through accumulated tolerance for failures that "don't count." The team has learned that some red builds are real (code problem) and some are noise (environment problem), and they've learned to distinguish between them by feel rather than by pipeline reliability. The moment that distinction becomes learned behavior, your pipeline has stopped being a reliable safety net.

Why Flakiness Is a Trust Problem, Not a Time Problem

The obvious cost of a flaky test is the time spent on retries. A pipeline that runs 30 times a day with a 5% flake rate wastes 1.5 pipeline runs per day on retries — annoying but not catastrophic.

The hidden cost is that developers learn to discount red builds. Once the team accepts "that's probably just flakiness" as a valid response to a failure, every genuine failure has to compete with that assumption. How many times will a developer retry a real regression before concluding it's a flake? Once? Twice? The answer depends on how often retrying worked before — which is exactly what a high flake rate trains them to expect.

In high-flake environments, genuine regressions get merged. Not because developers are careless, but because the pipeline has taught them that red doesn't mean broken.

The Common Sources and Their Fixes

Time-dependent tests are the most common and most fixable. Any test that calls new Date(), System.currentTimeMillis(), or Instant.now() directly is potentially flaky if it asserts on timing or relies on a specific temporal state.

// Flaky: behavior changes based on when the test runs
@Test
void shouldRejectExpiredToken() {
    Token token = new Token(Instant.now().minusSeconds(5));
    assertTrue(token.isExpired()); // passes if run fast enough, fails if slow
}

// Stable: inject a controllable clock
@Test
void shouldRejectExpiredToken() {
    Clock fixed = Clock.fixed(Instant.parse("2026-04-25T10:00:00Z"), UTC);
    Token token = new Token(Instant.parse("2026-04-25T09:59:54Z"), fixed);
    assertTrue(token.isExpired(fixed.instant()));
}

External service dependencies are the second largest source. Tests that hit real HTTP endpoints, real databases, or real message brokers are subject to network variability, service availability, and rate limiting. Mock external services with WireMock for HTTP, use Testcontainers for databases but with properly configured startup health checks, and use in-memory implementations (like an embedded Kafka) for message brokers in unit tests.

// Testcontainers with proper startup guarantee
@Container
static PostgreSQLContainer<?> postgres = new PostgreSQLContainer<>("postgres:16")
    .withStartupTimeout(Duration.ofSeconds(60))
    .waitingFor(Wait.forHealthcheck());  // Don't just wait for port — wait for readiness

Shared mutable state between tests causes interference that depends on execution order. This is particularly common in Spring Boot integration tests that share an application context with mutable singletons or caches. Use @DirtiesContext to force context reload when state modification is unavoidable, or redesign tests to be state-independent.

Resource contention on CI runners — tests that bind to specific ports, write to specific file paths, or allocate more memory than available on the runner. Use random port allocation (port: 0 in Spring Boot tests), temp directories, and check runner memory specs against what your tests actually need.

Tracking Flakiness

You can't fix what you're not measuring. Set up flake tracking before optimizing:

# Simplified flake detection: runs that failed, then passed on retry
# Query your CI API for the last 30 days of runs
# Flag any run where: final_status == 'success' AND any_prior_attempt_status == 'failed'

flake_rate = flaky_runs / total_runs

# Per-test: if you export test results as JUnit XML, aggregate across runs
# A test that shows both PASS and FAIL in the last 100 runs is flaky

Most CI platforms (GitHub Actions, CircleCI, BuildKite) have built-in test insights that show flaky tests over time. If yours doesn't, export JUnit XML from your test runner and aggregate it externally.

Set a target — 1% flake rate across all pipeline runs — and treat exceeding it as a P2 incident. Not a someday cleanup task. An active incident with an owner and a resolution date.

The Policy That Accelerates Fixing

The most effective policy for eliminating flakiness is simple: any test that flakes twice in a week gets quarantined (moved to a non-blocking suite) within 24 hours, and gets fixed or deleted within two weeks. Quarantined means it still runs but doesn't block merging, so flakiness doesn't propagate into developer workflow while the fix is in progress.

This policy forces a decision: fix the test or delete it. Both are better than a flaky test in a blocking suite. The tests that "we should fix someday" never get fixed. The tests in a quarantine queue with a deadline do.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

From Figma to API: A Structured Backend Development Process

You got a gorgeous Figma design and think, “Easy, backend can just follow this, right?” Not so fast. Without a clear technical plan, even perfect screens can lead to messy APIs.

Read more

When One Developer Chooses a Technology Nobody Else Understands

You trusted your developer to pick the right tools. Now the rest of the team can’t touch the code without a manual in another language.

Read more

Turning One Contract Into a Long Term Relationship

A single successful contract is valuable. A long-term relationship with the client who gave it to you is worth multiples of that — in income, in referrals, and in the kind of work you get to do.

Read more

Ruby Modules and Mixins — Composition Over Inheritance in Practice

Inheritance hierarchies in Ruby tend to collapse under their own weight. Modules give you a way out, but only if you understand method lookup, hook methods, and where the pattern breaks down.

Read more