The Pipeline Step Nobody Wants to Optimize Until It Hurts
by Eric Hanson, Backend Developer at Clean Systems Consulting
The Step Everyone Treats as Solved
Your pipeline runs tests, builds an image, maybe scans it for vulnerabilities. Then, somewhere in the deployment process, there's a migration step. It runs flyway migrate or liquibase update against the target database, and if it works, the app starts. If it doesn't, you're debugging at 2pm on a Thursday while traffic is routing to a service that can't start.
Migration handling is the step most teams don't examine because it "just works" — until it doesn't. When it fails, the failure modes are severe: a blocking lock on a large table, a migration that ran partially before timing out, a schema change that's incompatible with the currently deployed application version, or a migration that worked in staging (MySQL 8.0.28 with a 30-second lock timeout) but times out in production (MySQL 8.0.32 with a different default).
Why Migrations Are Structurally Different From Other Pipeline Steps
Most pipeline steps are idempotent and isolated. Running unit tests twice doesn't change anything. Building the Docker image twice produces the same artifact. Migrations are neither: they mutate shared state (the database schema), they're often not safely re-runnable, and their effects are visible immediately to any currently-running instance of the application.
This means migration failures don't just fail the deployment — they can leave the database in a state where neither the old nor new version of the application can run correctly. That is a production incident, not a failed pipeline run.
What Actually Goes Wrong
Non-backward-compatible changes. Adding a NOT NULL column without a default causes existing application instances (if you're doing a rolling deploy) to fail when inserting rows that don't include the new column. The migration succeeds; the currently-running pods start failing.
-- Dangerous: breaks running instances during rolling deploy
ALTER TABLE payments ADD COLUMN reference_id VARCHAR(64) NOT NULL;
-- Safe: add nullable first, backfill, then add constraint in next release
ALTER TABLE payments ADD COLUMN reference_id VARCHAR(64) NULL;
Lock acquisition on large tables. A simple ALTER TABLE on a table with 50 million rows will attempt to acquire an exclusive lock. In MySQL (InnoDB), this can wait indefinitely behind existing transactions, blocking all writes to the table. In PostgreSQL, ALTER TABLE ADD COLUMN DEFAULT has been lock-free since version 11 for simple types — but ALTER TABLE ADD COLUMN NOT NULL still locks.
Timeouts that leave partial state. If your migration runner has a 30-second timeout and a migration takes 35 seconds in production, you get a partial migration (depending on whether the statement was atomic) or a failed migration that Flyway marks as broken — preventing future migrations from running.
Migration Testing in the Pipeline
The place to catch these issues is CI, not production. But most pipelines run migrations against a fresh empty database on every run, which means they never catch migration problems that only appear at scale or against real data distributions.
A more useful approach:
# Integration test job: run migrations against a snapshot of production schema
integration-tests:
services:
postgres:
image: postgres:16
env:
POSTGRES_DB: testdb
steps:
- uses: actions/checkout@v4
# Restore a schema dump (not data) from production
- name: Restore baseline schema
run: |
psql $DATABASE_URL < ./db/baseline-schema.sql
# Run pending migrations against the baseline
- name: Run migrations
run: ./gradlew flywayMigrate
# Then run application tests
- name: Run integration tests
run: ./gradlew integrationTest
The baseline schema should be a recent dump of the production schema structure (not data), updated monthly or when significant schema changes land. This gives you migration tests that run against a realistic starting point rather than an empty database.
Separating Migration from Deployment
The most resilient pattern is running migrations separately from application deployment, with an explicit validation step between:
- Pre-deployment migration: run the migration before deploying new application code
- Validation: confirm the database is in the expected state
- Application deployment: deploy new code that is compatible with both old and new schema
- Post-deployment cleanup (next release): remove backward-compatibility shims
This requires that every migration be backward-compatible with the current application version. It's more design work per migration, but it eliminates the class of incidents where a bad migration causes the entire fleet of application instances to fail simultaneously.
#!/bin/bash
# deploy.sh: migrations first, deploy second, validate between
echo "Running database migrations..."
flyway -url="$DB_URL" -user="$DB_USER" -password="$DB_PASSWORD" migrate
if [ $? -ne 0 ]; then
echo "Migration failed. Aborting deployment."
exit 1
fi
echo "Validating schema..."
flyway -url="$DB_URL" validate
echo "Deploying application..."
kubectl set image deployment/myapp myapp="$IMAGE_TAG"
kubectl rollout status deployment/myapp --timeout=5m
The step nobody wants to optimize is worth 20% of your incident risk. Treat it with the care it deserves.