What Happens When Your Cache and Your Database Disagree
by Eric Hanson, Backend Developer at Clean Systems Consulting
When Two Sources of Truth Diverge
A user updates their shipping address. Your application writes the new address to the database and deletes the cache entry. Half a second later, another request comes in. The cache miss triggers a database read — but due to replication lag, the read replica that serves this query still has the old address. The new cache entry is populated with the old value. For the next 5 minutes, until TTL expires, every request sees the old address even though the database primary has the correct one.
This is not a hypothetical failure mode. It is a specific, predictable consequence of using read replicas in combination with a write-invalidate caching pattern. Understanding the mechanics of how cache and database diverge tells you how to structure your reads and writes to minimize the window.
The Four Ways Divergence Happens
Replication lag on invalidation. You invalidate the cache after a write, but the subsequent cache miss reads from a replica that has not yet received the write. The cache is repopulated with stale data from the replica. This is the scenario above.
Mitigation: on cache miss after a write, read from the primary. A simple way to implement this is to set a short flag in the user's session or request context indicating "this user just wrote data — route this read to primary." Alternatively, use a short TTL after write invalidation that forces a primary read on the next miss.
Failed invalidation. The database write succeeds. The cache invalidation fails — Redis is momentarily unavailable, the network drops, the process crashes. The stale cache entry persists until TTL.
Mitigation: keep TTLs short enough that the backstop is meaningful. Accept that write-through invalidation without a TTL backstop is not a reliable strategy.
Race between two concurrent writes. Writer A updates the record and deletes the cache. Writer B updates the record and deletes the cache. Writer A's cache read-through populates the cache with a value that may predate Writer B's change, depending on read replica lag.
# Timeline of a write-write race:
t=0ms: Writer A: UPDATE users SET name = 'Alice' WHERE id = 1
t=1ms: Writer A: DELETE cache["user:1"]
t=2ms: Writer B: UPDATE users SET name = 'Alicia' WHERE id = 1
t=3ms: Writer B: DELETE cache["user:1"]
t=4ms: Cache miss -- read from replica
Replica lag = 20ms -- replica still has original value "Alex"
t=4ms: cache["user:1"] = "Alex" <- both "Alice" and "Alicia" are now wrong
This scenario requires either optimistic locking with version numbers or accepting the eventual consistency window and using TTL to bound it.
Manual database changes. A developer runs an UPDATE directly on the database to fix a data issue. The cache is not invalidated. The cache holds the pre-fix value. This is the most common cause of "but I fixed it in the database, why is the application still showing the old value."
Mitigation: make manual database fixes include a corresponding cache invalidation step. For critical data, use an internal admin API that handles both operations atomically rather than direct database access.
Patterns That Reduce Divergence Risk
Read-your-writes consistency. Route the read that immediately follows a write to the same primary the write went to. This eliminates the replication lag window for the writing user.
Versioned cache keys. Instead of invalidating a cache entry, write a new entry with a new version key and update a pointer to the current version. Readers always look up the current version pointer, then fetch the versioned entry.
# Versioned cache key pattern:
def get_user(user_id):
version = cache.get(f"user:{user_id}:version") or 1
return cache.get(f"user:{user_id}:v{version}")
def update_user(user_id, data):
db.update(user_id, data)
new_version = db.get_version(user_id)
cache.set(f"user:{user_id}:v{new_version}", data, ex=300)
cache.set(f"user:{user_id}:version", new_version, ex=300)
# Old versioned entry naturally expires
Short TTL as the consistency bound. Accept eventual consistency with a defined bound. A 30-second TTL means divergence resolves within 30 seconds. This is an explicit choice, not a failure — document it, ensure stakeholders understand it, and verify it is acceptable for the data type.
The goal is not eliminating divergence — that requires transactions across two systems, which is impractical. The goal is understanding the divergence window, bounding it, and ensuring the system fails safely when it occurs.