Testing Your Docker Setup Before It Hits Production
by Arif Ikhsanudin, Backend Developer
The configuration bug that only appears in production
Your Docker Compose file looks correct. Your Dockerfile builds cleanly. CI is green. The image deploys to staging, starts, passes the basic health check, and gets promoted to production. Six hours later, the on-call engineer gets paged: the application is running but writing to a path that doesn't exist in the Kubernetes pod's read-only filesystem config, and error logs are filling a tmpfs mount that's out of space.
This wasn't a code bug. It was a configuration assumption — that the filesystem was writable — that wasn't tested before it hit a restricted environment. A fifteen-minute local validation step would have caught it.
Here's the validation process that surfaces these issues before deployment.
Layer 1: validate the image builds and runs
The baseline: the image builds, starts, and the process doesn't immediately exit.
# Build the image
docker build -t myapp:test .
# Start it and check it's running after 5 seconds
docker run -d --name myapp-test myapp:test
sleep 5
if ! docker ps | grep -q myapp-test; then
echo "Container exited immediately"
docker logs myapp-test
exit 1
fi
# Clean up
docker rm -f myapp-test
Add a health check endpoint test:
docker run -d --name myapp-test -p 8080:8080 myapp:test
sleep 10
if ! curl -sf http://localhost:8080/health; then
echo "Health check failed"
docker logs myapp-test
exit 1
fi
docker rm -f myapp-test
This catches: crashes on startup, missing environment variables (if you validate at startup), JVM OOM errors, configuration loading failures.
Layer 2: test with production-like constraints
Run the container with the same security and resource constraints it'll have in production:
docker run -d \
--name myapp-constrained \
--read-only \
--tmpfs /tmp:size=64m \
--memory=512m \
--cpus=0.5 \
--user 1001:1001 \
--cap-drop ALL \
--security-opt no-new-privileges:true \
-p 8080:8080 \
myapp:test
sleep 15
# Verify it's still running
if ! docker ps | grep -q myapp-constrained; then
echo "Container failed under production constraints"
docker logs myapp-constrained
exit 1
fi
# Verify health endpoint
curl -sf http://localhost:8080/health || { docker logs myapp-constrained; exit 1; }
docker rm -f myapp-constrained
This catches: filesystem writes to unvolume'd paths (read-only failure), insufficient memory (OOM under 512MB), permission errors (wrong user), startup operations that require dropped capabilities.
For services that legitimately need to write to specific paths, add the volumes:
docker run -d \
--read-only \
--tmpfs /tmp:size=64m \
--mount type=volume,source=test_logs,target=/app/logs \
# ... other flags
myapp:test
Layer 3: test the full Compose stack
Integration test the Compose configuration — not just the image in isolation, but the entire stack with networking, dependency ordering, and environment variable injection.
# Start the full stack
docker compose up -d
# Wait for all services to be healthy (with timeout)
TIMEOUT=120
ELAPSED=0
until docker compose ps | grep -v "healthy" | grep -v "NAME" | wc -l | grep -q "^0$"; do
if [ $ELAPSED -ge $TIMEOUT ]; then
echo "Services did not become healthy within ${TIMEOUT}s"
docker compose ps
docker compose logs
exit 1
fi
sleep 5
ELAPSED=$((ELAPSED + 5))
done
echo "All services healthy"
# Run a smoke test against the running stack
curl -sf http://localhost:8080/health || exit 1
curl -sf http://localhost:8080/api/ping || exit 1
# Clean up
docker compose down -v
This catches: dependency ordering issues (depends_on not waiting correctly), network configuration problems (service can't reach database), environment variable configuration errors, volume mount permission issues in the Compose context.
Layer 4: production environment parity test
Write a script that verifies specific production constraints are met. Run it in CI:
#!/bin/bash
# validate-docker.sh
IMAGE="myapp:$1"
ERRORS=0
# Check: runs as non-root
USER=$(docker run --rm --entrypoint id "$IMAGE" -u)
if [ "$USER" = "0" ]; then
echo "FAIL: Container runs as root (UID 0)"
ERRORS=$((ERRORS + 1))
else
echo "PASS: Container runs as UID $USER"
fi
# Check: no hardcoded secrets in environment
ENV_VARS=$(docker inspect "$IMAGE" --format '{{range .Config.Env}}{{.}}\n{{end}}')
for secret_keyword in PASSWORD SECRET KEY TOKEN; do
if echo "$ENV_VARS" | grep -qi "^${secret_keyword}=.\+"; then
echo "FAIL: Possible hardcoded secret matching ${secret_keyword} found in image ENV"
ERRORS=$((ERRORS + 1))
fi
done
echo "PASS: No obvious hardcoded secrets in image ENV"
# Check: no .git directory in image
if docker run --rm "$IMAGE" sh -c 'find / -name ".git" -type d 2>/dev/null | grep -q .'; then
echo "FAIL: .git directory found in image"
ERRORS=$((ERRORS + 1))
else
echo "PASS: No .git directory in image"
fi
# Check: no .env files in image
if docker run --rm "$IMAGE" sh -c 'find / -name ".env" 2>/dev/null | grep -q .'; then
echo "FAIL: .env file found in image"
ERRORS=$((ERRORS + 1))
else
echo "PASS: No .env files in image"
fi
# Check: HEALTHCHECK is defined
if docker inspect "$IMAGE" --format '{{.Config.Healthcheck}}' | grep -q '<nil>'; then
echo "WARN: No HEALTHCHECK defined in image"
fi
echo ""
if [ $ERRORS -gt 0 ]; then
echo "Validation failed with $ERRORS error(s)"
exit 1
else
echo "All validation checks passed"
fi
Run this in CI after building the image:
- name: Validate image
run: ./scripts/validate-docker.sh myapp:${{ github.sha }}
Layer 5: load test under container constraints
For services where performance under resource limits matters, run a brief load test against the containerized service:
docker run -d \
--name myapp-load \
--memory=512m \
--cpus=0.5 \
-p 8080:8080 \
myapp:test
sleep 15
# k6, hey, or ab for load testing
# k6 run --vus 10 --duration 30s load-test.js
hey -n 1000 -c 10 http://localhost:8080/api/endpoint
# Check container didn't OOM
if docker inspect myapp-load --format '{{.State.OOMKilled}}' | grep -q true; then
echo "FAIL: Container was OOM killed during load test"
exit 1
fi
docker rm -f myapp-load
hey (github.com/rakyll/hey) is a simple HTTP load generator. k6 (k6.io) is more capable for complex scenarios. Either can tell you quickly whether your service survives its production resource allocation under realistic load.
Integrating into CI
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build image
run: docker build -t myapp:${{ github.sha }} .
- name: Validate image
run: ./scripts/validate-docker.sh myapp:${{ github.sha }}
- name: Integration test
run: |
docker compose up -d
./scripts/wait-healthy.sh
./scripts/smoke-test.sh
docker compose down -v
- name: Constraint test
run: |
docker run -d --name test \
--read-only --tmpfs /tmp --memory=512m --user 1001 \
-p 8080:8080 myapp:${{ github.sha }}
sleep 15
curl -sf http://localhost:8080/health
docker rm -f test
Each layer catches a different class of problem. Together they close the gap between "works in development" and "fails in production configuration." The investment is one afternoon to write the scripts — the return is production incidents that don't happen.