Why Your Docker Image Works Locally But Breaks in Production
by Eric Hanson, Backend Developer at Clean Systems Consulting
The container that passes CI and fails in ECS
Your service works in local Docker Compose. It builds clean in CI. The image gets pushed to ECR. It deploys to ECS Fargate, and the health check fails. The logs show the application started but the health endpoint never responds. You spend two hours ssh-ing into nothing (Fargate doesn't have ssh), reading CloudWatch logs, and eventually discover the issue is a read-only filesystem mount the app was trying to write to.
This is the category of Docker problem that doesn't show up until production: the image is fine, the Dockerfile is fine, but the environment the image runs in is different in ways you didn't account for.
Here's a map of the most common mismatches, and how to close them.
Architecture: the ARM/AMD64 gap
If you develop on an Apple M-series Mac, you build ARM64 images by default. If production runs on x86-64 (most cloud instances, most CI runners), the image you built locally won't run in production — or worse, it will run via emulation and behave differently.
Verify your image architecture:
docker inspect your-image:tag | grep Architecture
Build for the production platform explicitly:
docker build --platform linux/amd64 -t your-image:tag .
Or use docker buildx for multi-platform builds that produce manifests supporting both:
docker buildx build \
--platform linux/amd64,linux/arm64 \
-t your-registry/your-image:tag \
--push .
In CI, always set --platform linux/amd64 (or whatever your production target is) explicitly. Don't let the runner's native architecture determine the output architecture.
File permissions and user mismatch
Locally, Docker often runs as root or with a user that matches your laptop's UID. In production environments — Kubernetes with runAsNonRoot: true, ECS task definitions with a user field, Fargate with restricted execution — the container user may differ.
If your application writes to a directory inside the container that was created by root during the build, a non-root runtime user will get permission denied errors.
# Creates /app as root, copies files as root
FROM node:20-alpine
WORKDIR /app
COPY --chown=node:node . .
USER node
The --chown flag on COPY sets ownership at copy time. Do this for all COPY instructions when you intend to run as non-root. Also:
RUN mkdir -p /app/logs /app/tmp \
&& chown -R node:node /app/logs /app/tmp
Create any directories your application writes to during build, set ownership explicitly, then switch to the non-root user.
Environment variables: present locally, absent in production
In local development, environment variables come from a .env file loaded by Docker Compose or a local shell profile. In production, they come from Kubernetes secrets, ECS task definition environment fields, or a secrets manager at startup.
The failure mode: a variable is set in your local .env but missing from the production environment config. The application starts, reaches the code path that uses the variable, and either crashes or behaves unexpectedly.
Fail fast at startup for required variables:
// Node.js
const required = ['DATABASE_URL', 'JWT_SECRET', 'PORT'];
for (const key of required) {
if (!process.env[key]) {
console.error(`Missing required environment variable: ${key}`);
process.exit(1);
}
}
// Spring Boot — fail fast with @Value
@Value("${database.url:#{null}}")
private String databaseUrl;
@PostConstruct
public void validate() {
if (databaseUrl == null) {
throw new IllegalStateException("database.url must be configured");
}
}
An application that crashes at startup with a clear error message (Missing required environment variable: DATABASE_URL) is infinitely easier to diagnose than one that starts, fails silently, and reports a 500 response three requests later.
Resource limits: unlimited locally, constrained in production
Local Docker runs don't have memory or CPU limits unless you explicitly set them. Production environments almost always do — Kubernetes resource limits, ECS task definition memory, Fargate task size.
The failure mode: your JVM application uses up to 4GB heap locally. In production it's limited to 1GB. The JVM's default heap sizing is based on the host's total memory, not the container's limit. In older JVM versions (pre-11), the JVM would ignore container memory limits entirely and allocate heap based on host RAM, leading to OOMKilled containers.
For JVM applications in containers, always set explicit GC and heap options:
ENV JAVA_OPTS="-XX:+UseContainerSupport -XX:MaxRAMPercentage=75.0 -XX:InitialRAMPercentage=50.0"
ENTRYPOINT ["sh", "-c", "java $JAVA_OPTS -jar app.jar"]
-XX:+UseContainerSupport (default since JDK 8u191) makes the JVM respect container memory limits. -XX:MaxRAMPercentage=75.0 sets heap to 75% of the container's memory limit, leaving headroom for the JVM's off-heap memory and the OS.
Test locally with the same limits as production:
docker run --memory=512m --cpus=0.5 your-image:tag
If the app fails under these constraints, you want to know before production does.
Filesystem: writable locally, read-only in production
Kubernetes securityContext.readOnlyRootFilesystem: true mounts the container root filesystem as read-only. If your application writes anywhere inside the container filesystem (temp files, log files, PID files, JVM crash dumps), it will fail.
Common offenders:
- Log files written to a path like
/app/logs/ - JVM's
-XX:+HeapDumpOnOutOfMemoryErrorwriting to the working directory - Applications writing temp files to
/tmp
Solutions:
- Write logs to stdout/stderr, not files (let the orchestrator handle log collection)
- Mount a writable volume for any path that needs writes:
/tmp,/app/logs, etc. - Configure JVM heap dumps to a mounted volume path
In your Kubernetes deployment:
securityContext:
readOnlyRootFilesystem: true
volumeMounts:
- name: tmp
mountPath: /tmp
- name: logs
mountPath: /app/logs
volumes:
- name: tmp
emptyDir: {}
- name: logs
emptyDir: {}
Test this locally:
docker run --read-only --tmpfs /tmp your-image:tag
If the application starts cleanly under --read-only, it will work with readOnlyRootFilesystem: true in Kubernetes.
Networking: localhost means something different
In Docker Compose, services reach each other by service name. postgresql://postgres:5432/mydb works because Compose creates a network and registers DNS for service names. Your application assumes the same pattern in production.
In production (Kubernetes, ECS), the networking model is different: services are reached via cluster DNS (myservice.namespace.svc.cluster.local) or environment-injected service endpoints, not Compose service names.
The fix is ensuring your application's service endpoints are fully configurable via environment variables and that local defaults don't leak into production configs. Never hardcode localhost or Compose service names in application code. Everything that varies between environments goes into environment variables.
Close the gap intentionally
Add a docker run test to your CI pipeline that mimics production constraints before the image is pushed:
docker run \
--read-only \
--tmpfs /tmp \
--memory=512m \
--cpus=0.5 \
--user 1001:1001 \
--env-file .env.test \
--platform linux/amd64 \
your-image:tag \
/bin/sh -c "echo 'startup check passed'"
This catches the most common environment mismatches before the image reaches a real environment. Not everything — but enough to stop the "works locally, fails in production" class of incidents before they happen.