Where Java Applications Lose Time — CPU, I/O, Lock Contention, and GC
by Eric Hanson, Backend Developer at Clean Systems Consulting
The diagnostic mistake that wastes the most time
Optimizing before profiling is the most expensive mistake in performance work. The second most expensive is profiling the wrong thing — adding a connection pool when the bottleneck is CPU, or tuning GC when threads are blocked on a lock. Each category of performance problem has a distinct signature. Identify the category first. Then optimize.
The four categories: CPU-bound (the application does too much computation), I/O-bound (threads spend time waiting for network or disk), lock-contention-bound (threads wait on each other), and GC-bound (garbage collection pauses or overhead dominate). Real applications often have more than one, but one is almost always primary.
Identifying the category
CPU-bound: process CPU usage is high (near 100% of available cores), and latency scales linearly with load. Adding more CPU helps. Reducing computation helps more.
I/O-bound: CPU usage is low despite high request volume. Threads spend most of their time in WAITING or TIMED_WAITING, blocked on network reads, database queries, or file operations. Adding more threads (up to a point) helps. Reducing I/O or parallelizing it helps more.
Lock-contention-bound: CPU usage is low, thread count is high, but throughput is still limited. Threads are BLOCKED — waiting to acquire a monitor held by another thread. Symptom: throughput doesn't improve (or degrades) as you add threads. jstack shows many threads blocked on the same lock.
GC-bound: application throughput or latency degrades periodically in a pattern that correlates with GC cycles. GC logs show long pause times or high GC overhead percentage. CPU usage spikes during pauses.
The diagnostic tool for initial category identification: a thread dump (jstack <pid> or kill -3 <pid>) combined with top -H -p <pid>. Thread states in jstack — RUNNABLE, WAITING, TIMED_WAITING, BLOCKED — tell you where threads are spending time. CPU per-thread in top -H tells you which threads are consuming CPU.
CPU-bound: finding hot methods
If threads are RUNNABLE and CPU is high, the work is genuinely CPU-intensive. The question is which code is responsible.
async-profiler is the right tool. It uses AsyncGetCallTrace to sample thread stacks without the safepoint bias that afflicts JVM-based profilers — it captures threads mid-flight, including threads doing native I/O, not just at safepoint pauses:
# Profile CPU for 30 seconds, generate flamegraph
./profiler.sh -d 30 -f flamegraph.html <pid>
The flamegraph visualization shows the widest frames as the hottest code paths. The frame at the top of the widest stack is where the application spends the most time.
Common CPU hot spots in Java applications:
Serialization/deserialization. JSON parsing and generation is CPU-intensive at high throughput. Jackson's default ObjectMapper is not thread-safe; a common mistake is creating a new one per request. Benchmark with JMH before switching serializers, but jackson-databind with a shared, configured ObjectMapper is usually fast enough. For extreme throughput, jsoniter or dsl-json are measurably faster.
Regular expressions. Pattern.compile() is expensive. Called per-request with the same pattern, it wastes CPU re-compiling. Compile patterns once and store them as static finals:
// Wrong — compiles the pattern on every call
public boolean isValidEmail(String input) {
return input.matches("[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+");
}
// Correct — compiled once
private static final Pattern EMAIL = Pattern.compile("[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+");
public boolean isValidEmail(String input) {
return EMAIL.matcher(input).matches();
}
Excessive object allocation. High allocation rate means high GC frequency, which means CPU spent in GC rather than application work. async-profiler's allocation profiling mode (-e alloc) shows which call sites allocate the most bytes. The fix is usually object reuse, primitive collections, or moving computation outside hot loops.
I/O-bound: reducing wait time
If threads are mostly WAITING or TIMED_WAITING and CPU is low, the application is waiting on external systems. The question is which waits dominate and whether they can be parallelized.
Database queries are the most common I/O bottleneck. The diagnostic: enable slow query logging in the database, or use p99 latency per query in your APM. Queries that show up at the top of total execution time — not just slowest individually, but slowest multiplied by call frequency — are the priority.
N+1 query patterns (common in JPA/Hibernate applications) generate hundreds of queries where one should suffice. hibernate.show_sql=true in development, or a query count assertion in tests, catches these before production.
Connection pool exhaustion is a different I/O problem — the database can handle more load but the application can't issue more queries because all connections are in use. Symptom: threads TIMED_WAITING in pool acquisition code. Diagnosis: HikariCP's maximumPoolSize metric and pool wait time. Fix: increase pool size (to the database's connection limit) or reduce query latency to free connections faster.
External HTTP calls. Synchronous HTTP calls to external services block a thread for their full duration. At high concurrency, this exhausts the thread pool. Options: increase the thread pool (expensive in memory), use async HTTP clients (WebClient, Async HttpClient), or cache responses where freshness requirements allow.
Lock contention: finding blocked threads
Lock contention is the subtlest performance category. CPU is low, I/O seems fine, but throughput is capped. Threads are blocked waiting for locks held by other threads.
jstack is the first diagnostic tool. Take several thread dumps a few seconds apart. Threads consistently in BLOCKED state waiting on the same monitor are the contention point. The thread holding the lock is identifiable in the dump — look for - locked <0x...> in the owning thread's stack.
async-profiler in lock profiling mode:
./profiler.sh -d 30 -e lock -f lock_flamegraph.html <pid>
This samples lock contention events and produces a flamegraph of which code paths are waiting on which locks.
Common contention sources in Java applications:
synchronized on a shared object. A frequently-called method on a shared singleton that synchronizes on this serializes all callers. Replace with java.util.concurrent types: ConcurrentHashMap instead of synchronized HashMap, AtomicLong instead of synchronized increment, ReadWriteLock when reads dominate writes.
HttpClient or connection pool shared across threads without sufficient concurrency. Connection pools are internally synchronized. If the pool's maximum connections is lower than the request concurrency, threads queue for connections — a contention pattern that looks like I/O but is actually lock-based.
Logging frameworks. Some logging configurations synchronize on the appender. Under high log volume, log calls become a contention point. Use async appenders (Logback's AsyncAppender, Log4j2's async logging) for any appender that does I/O.
GC-bound: tuning for throughput or latency
GC overhead is measurable. The GC logs article covered tuning in detail; the diagnostic addition here is distinguishing GC overhead from application throughput problems.
The key metric: GC overhead percentage — the fraction of wall clock time spent in GC pause. JVM exposes this:
jstat -gcutil <pid> 1000
The GCT column is total GC time in seconds. Divide by elapsed time for the percentage. Above 5% sustained is a problem. Above 10% is severe — the JVM will throw OutOfMemoryError: GC overhead limit exceeded at 98% by default.
Allocation rate matters more than heap size for GC frequency. A smaller heap with a low allocation rate GCs rarely. A large heap with a high allocation rate GCs constantly. Profile allocation with async-profiler before adding heap.
Promotion failure is the most disruptive GC event. When the old generation fills before a major GC can run, the JVM falls back to a stop-the-world full GC — potentially seconds of pause. The fix is preventing premature promotion: increase young generation size, reduce allocation rate of long-lived objects, or switch to ZGC which doesn't have this failure mode.
The full diagnostic flow
A systematic approach rather than jumping to the most familiar tool:
1. Measure baseline: latency p50/p99/p999, throughput, CPU%, memory RSS
2. Take thread dump: what states are threads in? RUNNABLE / WAITING / BLOCKED?
3. If RUNNABLE + high CPU → async-profiler CPU flamegraph
4. If WAITING + low CPU → identify what they're waiting on (DB? HTTP? pool?)
5. If BLOCKED + low CPU → async-profiler lock flamegraph, jstack for contention point
6. If periodic latency spikes → check GC logs, jstat -gcutil
7. Fix the primary bottleneck. Re-measure. Repeat.
The re-measure step is not optional. Fixing one bottleneck reveals the next. An application that was I/O-bound may become CPU-bound after query optimization frees threads to do more computation. The category can shift with each fix.
The goal is not eliminating all overhead — it's identifying which overhead is primary and addressing that one first. Everything else is premature optimization.