Your Pipeline Is Flaky and That Is a Bigger Problem Than You Think

by Arif Ikhsanudin, Backend Developer

The Red Build Nobody Investigates

Your pipeline fails. A developer looks at the job name and the stage, pulls up the logs, sees "connection refused" on a Testcontainer startup, and clicks retry. Forty seconds later, it's green. They move on. This happened three times yesterday. Nobody filed a ticket.

This is the slow death of your CI system. Not through a catastrophic failure — through accumulated tolerance for failures that "don't count." The team has learned that some red builds are real (code problem) and some are noise (environment problem), and they've learned to distinguish between them by feel rather than by pipeline reliability. The moment that distinction becomes learned behavior, your pipeline has stopped being a reliable safety net.

Why Flakiness Is a Trust Problem, Not a Time Problem

The obvious cost of a flaky test is the time spent on retries. A pipeline that runs 30 times a day with a 5% flake rate wastes 1.5 pipeline runs per day on retries — annoying but not catastrophic.

The hidden cost is that developers learn to discount red builds. Once the team accepts "that's probably just flakiness" as a valid response to a failure, every genuine failure has to compete with that assumption. How many times will a developer retry a real regression before concluding it's a flake? Once? Twice? The answer depends on how often retrying worked before — which is exactly what a high flake rate trains them to expect.

In high-flake environments, genuine regressions get merged. Not because developers are careless, but because the pipeline has taught them that red doesn't mean broken.

The Common Sources and Their Fixes

Time-dependent tests are the most common and most fixable. Any test that calls new Date(), System.currentTimeMillis(), or Instant.now() directly is potentially flaky if it asserts on timing or relies on a specific temporal state.

// Flaky: behavior changes based on when the test runs
@Test
void shouldRejectExpiredToken() {
    Token token = new Token(Instant.now().minusSeconds(5));
    assertTrue(token.isExpired()); // passes if run fast enough, fails if slow
}

// Stable: inject a controllable clock
@Test
void shouldRejectExpiredToken() {
    Clock fixed = Clock.fixed(Instant.parse("2026-04-25T10:00:00Z"), UTC);
    Token token = new Token(Instant.parse("2026-04-25T09:59:54Z"), fixed);
    assertTrue(token.isExpired(fixed.instant()));
}

External service dependencies are the second largest source. Tests that hit real HTTP endpoints, real databases, or real message brokers are subject to network variability, service availability, and rate limiting. Mock external services with WireMock for HTTP, use Testcontainers for databases but with properly configured startup health checks, and use in-memory implementations (like an embedded Kafka) for message brokers in unit tests.

// Testcontainers with proper startup guarantee
@Container
static PostgreSQLContainer<?> postgres = new PostgreSQLContainer<>("postgres:16")
    .withStartupTimeout(Duration.ofSeconds(60))
    .waitingFor(Wait.forHealthcheck());  // Don't just wait for port — wait for readiness

Shared mutable state between tests causes interference that depends on execution order. This is particularly common in Spring Boot integration tests that share an application context with mutable singletons or caches. Use @DirtiesContext to force context reload when state modification is unavoidable, or redesign tests to be state-independent.

Resource contention on CI runners — tests that bind to specific ports, write to specific file paths, or allocate more memory than available on the runner. Use random port allocation (port: 0 in Spring Boot tests), temp directories, and check runner memory specs against what your tests actually need.

Tracking Flakiness

You can't fix what you're not measuring. Set up flake tracking before optimizing:

# Simplified flake detection: runs that failed, then passed on retry
# Query your CI API for the last 30 days of runs
# Flag any run where: final_status == 'success' AND any_prior_attempt_status == 'failed'

flake_rate = flaky_runs / total_runs

# Per-test: if you export test results as JUnit XML, aggregate across runs
# A test that shows both PASS and FAIL in the last 100 runs is flaky

Most CI platforms (GitHub Actions, CircleCI, BuildKite) have built-in test insights that show flaky tests over time. If yours doesn't, export JUnit XML from your test runner and aggregate it externally.

Set a target — 1% flake rate across all pipeline runs — and treat exceeding it as a P2 incident. Not a someday cleanup task. An active incident with an owner and a resolution date.

The Policy That Accelerates Fixing

The most effective policy for eliminating flakiness is simple: any test that flakes twice in a week gets quarantined (moved to a non-blocking suite) within 24 hours, and gets fixed or deleted within two weeks. Quarantined means it still runs but doesn't block merging, so flakiness doesn't propagate into developer workflow while the fix is in progress.

This policy forces a decision: fix the test or delete it. Both are better than a flaky test in a blocking suite. The tests that "we should fix someday" never get fixed. The tests in a quarantine queue with a deadline do.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

Learning a New Technology Without Abandoning the Fundamentals

Frameworks, languages, and tools change. The underlying concepts they implement — data modeling, concurrency, network communication, failure handling — do not. Engineers who learn new technology through the lens of fundamentals learn faster and more durably.

Read more

Why Developers Who Skip Tests Always Regret It Eventually

Skipping tests feels like moving fast — until the codebase grows large enough that every change becomes a liability. Here is the specific point at which that debt comes due, and what it looks like when it does.

Read more

Citadel and CME Group Pay Chicago's Backend Developers More Than Most Startups Can Afford

Chicago has world-class backend engineering talent. The financial firms that employ most of it have built compensation structures specifically designed to keep it.

Read more

How to Handle a Client Freaking Out Because of a Bug

Bugs happen. How you react can turn a frustrated client into a loyal one—or the opposite. Handling panic gracefully is as important as fixing the issue itself.

Read more