The Difference Between Fixing a Bug and Understanding a Bug
by Arif Ikhsanudin, Backend Developer
The Fix That Created the Next Bug
A null pointer exception was appearing intermittently in the order processing service. The stack trace pointed to a line that accessed order.getCustomer().getAddress(). The fix: add a null check before the access.
// Before
String city = order.getCustomer().getAddress().getCity();
// After
String city = order.getCustomer() != null && order.getCustomer().getAddress() != null
? order.getCustomer().getAddress().getCity()
: null;
The NPE stopped appearing. The bug was not fixed — it was silenced. The underlying question — why would an order exist without a customer? — was never asked. Three months later, a data integrity issue was discovered: a batch import job was creating orders without linking them to customers. The silenced NPE had been the only signal. Now there were thousands of orphaned orders in production and no easy way to reconcile them.
This is the difference between fixing a bug and understanding a bug.
Why the Quick Fix Wins
There's significant pressure in most engineering environments to close tickets and move on. A bug fix that resolves the user-visible symptom closes the ticket. A bug investigation that questions a system assumption requires time, may reveal larger problems, and doesn't produce immediate visible output.
The incentive structure points toward the quick fix. The right engineering behavior often points the other way.
The Five-Why Approach to Bugs
Toyota's Five Whys methodology — ask "why" until you reach the root cause — applies directly to software debugging. The discipline is to not stop at the proximate cause.
Using the example above:
- Why did the NPE occur?
order.getCustomer()returned null. - Why was the customer null? Orders can be created without a customer reference.
- Why can orders be created without a customer? The data model doesn't enforce the relationship at the database level — it's nullable.
- Why is it nullable? It was left nullable during a batch import feature that imported historical orders without customer data.
- Why wasn't this addressed after the import? It was intended to be temporary and the follow-up ticket was never picked up.
The root cause is a data integrity constraint that was intentionally relaxed and never restored. The correct fix is to enforce the constraint in the database, migrate the orphaned records, and add a non-null constraint. The NPE was a symptom of a data model decision.
The Categories of Root Cause
Most bugs trace to one of a few root cause categories:
Missing validation: Input that should have been rejected was accepted and propagated to a state where it caused a failure later. The fix is validation at the entry point, not defensive checks throughout the system.
Violated invariant: The system assumes a property (every order has a customer, every session has a user, every transaction has an amount) that can be violated under certain conditions. The fix is either enforcing the invariant where it should hold, or redesigning the logic that depends on it.
Race condition: Two operations that individually are correct produce incorrect state when they execute concurrently. Null checks don't fix race conditions. Proper synchronization or atomic operations do.
Incorrect assumption about external behavior: The service assumes a third-party API returns in a specific format or within a specific time. The fix is either validating the assumption at the integration boundary or designing for its violation.
Missing edge case: A case that was not tested or considered during implementation. The fix includes both the case handling and the test that would have caught it.
What Understanding a Bug Produces
An investigation that reaches root cause produces:
- A fix that addresses the cause rather than the symptom
- A test that would have caught the bug (and will catch regressions)
- Possibly a design improvement that prevents the class of bugs
- Documentation of the finding — at minimum, a commit message that explains why the fix was made, not just what it does
This takes longer than a patch. It also avoids the pattern where the same bug appears in slightly different forms repeatedly because the underlying cause was never addressed.
The Practical Takeaway
For your next non-trivial bug fix, before writing any code, write down why the bug occurred at the most fundamental level you can reach with available information. If the answer is "the code was wrong," go one level deeper: why was the code wrong? Missing validation? Wrong assumption? Missing test? Let that answer guide both the fix and the test you write to prevent recurrence. Then check: is there anywhere else in the codebase making the same wrong assumption?